Let me give you an example to this very common question to show why we don’t want to standardize new (or test) data “from scratch.”

Let’s assume we have a simple training set consisting of 3 samples with 1 feature (let’s call this feature “length”):

  • train_1: 10 cm -> class_2
  • train_2: 20 cm -> class_2
  • train_3: 30 cm -> class_1

mean: 20, std.: 8.2

After standardization, the transformed feature values are

  • train_std_1: -1.21 -> class_2
  • train_std_2: 0 -> class_2
  • train_std_3: 1.21 -> class_1

Next, let’s assume our model has learned to classify samples with a standardized length value < 0.6 as class_2 (class_1 otherwise). So far so good. Now, let’s say we have 3 unlabeled data points that we want to classify:

  • new_4: 5 cm -> class ?
  • new_5: 6 cm -> class ?
  • new_6: 7 cm -> class ?

If we look at the “unstandardized “length” values in our training dataset, it is intuitive to say that all of these samples are likely belonging to class_2. However, if we standardize these by re-computing standard deviation and and mean you would get similar values as before in the training set and your classifier would (probably incorrectly) classify samples 4 and 5 as class 2.

  • new_std_4: -1.21 -> class 2
  • new_std_5: 0 -> class 2
  • new_std_6: 1.21 -> class 1

However, if we use the parameters from your “training set standardization,” we’d get the values:

  • sample5: -18.37 -> class 2
  • sample6: -17.15 -> class 2
  • sample7: -15.92 -> class 2

The values 5 cm, 6 cm, and 7 cm are much lower than anything we have seen in the training set previously. Thus, it only makes sense that the standardized features of the “new samples” are much lower than every standardized feature in the training set.




If you like this content and you are looking for similar, more polished Q & A’s, check out my new book Machine Learning Q and AI.