In this tutorial I want to briefly show why it is important to correctly scale you train and test data. Although I think most machine learning practicioners automatically avoid the fallacy of not scaling the test data with the learned scaler from the trainset, I think many practitioners do not know exactly why. Here, I will give a concrete example of why you need to use the scaler from the trainset for the testset as well.

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

Let's first create some dummy data. For example, we can assume the following are 3 different users, decribes by three variables. We also create the targets, which for example we can think of different clusters each user belongs to:

train_data = np.array([[0.8, 0.1, 0.1], [0.7, 0.2, 0.1], [0.85, 0.15, 0.]])
train_targets = np.array([2, 1, 0])
test_data = np.array([[0.8, 0.1, 0.1], [0.3, 0.5, 0.2], [0.85, 0.15, 0.]])

You see that in the test_data user 1 and user 3 are exactly alike. This is on purpose. Let's create two standard scalers, meaning we will substract the mean and divide by the variance.

train_scaler = StandardScaler()
test_scaler = StandardScaler()

scaled_train_data = train_scaler.fit_transform(train_data)
scaled_test_data = test_scaler.fit_transform(test_data)
correctly_scaled_test_data = train_scaler.transform(test_data)

You can already see the difference between the two approaches. The first two scaled_data are built with the fit_transform method from the StandardScaler, while the last approach uses the "trained" scaler from the trainset to scale the test_dataset. Let's have a look at the different means and variance:

train_scaler.mean_, train_scaler.var_
(array([0.78333333, 0.15      , 0.06666667]),
 array([0.00388889, 0.00166667, 0.00222222]))
test_scaler.mean_, test_scaler.var_
(array([0.65, 0.25, 0.1 ]), array([0.06166667, 0.03166667, 0.00666667]))

And this is how the data looks like:

pd.DataFrame(scaled_train_data)
0 1 2
0 0.267261 -1.224745e+00 0.707107
1 -1.336306 1.224745e+00 0.707107
2 1.069045 -6.798700e-16 -1.414214
pd.DataFrame(scaled_test_data)
0 1 2
0 0.604040 -0.842927 -1.699675e-16
1 -1.409428 1.404879 1.224745e+00
2 0.805387 -0.561951 -1.224745e+00
pd.DataFrame(correctly_scaled_test_data)
0 1 2
0 0.267261 -1.224745e+00 0.707107
1 -7.750576 8.573214e+00 2.828427
2 1.069045 -6.798700e-16 -1.414214

From first glance, the correctly scaled test data looks wrong, simply because the numbers seem so far off. However, let's have a look what happens when we fit a simple Linear Regression on our trainset and make predictions on the testsets:

lin_model = LinearRegression().fit(scaled_train_data, train_targets)

Remember, in our testset, we expect user 1 and user 3 to be classified as 2 and 0 respectively.

lin_model.predict(scaled_test_data)
array([1.33714607, 1.23420957, 0.42864436])

But this did not happen at all here. We see that neither the first user nor the third user are classified correctly, even though they are exactly the same. Let's see what the results are, when using the scaler from our trainset:

lin_model.predict(correctly_scaled_test_data)
array([ 2.00000000e+00, -5.00000000e-01, -1.11022302e-16])

This time, the Linear Model correctly classified user 1 and user 3. So you see what a difference wrongly used scaling makes. So keep this in mind when using train and testset.

Lasse