Why can a model have low training loss but still generate poor text?

Question

Accepted Answer

A model can have low training loss but still generate poor text because low loss only means it predicts the next token well on the training setup. That is not the same as producing good multi-sentence outputs in open-ended use.

The training objective is local: for each position, predict the next token. But good text quality is global. It depends on coherence, factuality, structure, instruction-following, and avoiding drift over many steps.

Training optimizes next-token prediction, which is a narrower objective than overall text quality

There are several common reasons for this mismatch.

Overfitting

The model may drive training loss down while generalizing poorly to new text. In that case, validation behavior matters more than training loss.

One-step versus rollout quality

During training, the model sees correct prefixes. During generation, it conditions on its own outputs. Small mistakes can compound over time even if one-step prediction loss was low.

Objective mismatch

Cross-entropy does not directly optimize for truthfulness, helpfulness, style, or task completion. A model can be statistically good at continuation without being especially useful to a human.

Decoding choices

Bad sampling settings can make a decent model look poor. Generation quality depends on both the trained model and the decoding policy.

Training curves can look good while downstream generation quality still depends on validation behavior, rollout dynamics, and decoding choices

This is one reason later repo chapters move beyond plain pretraining. Finetuning, instruction tuning, preference tuning, and better evaluation all exist because low next-token loss is not enough by itself.

So low loss is still valuable. It usually means the model learned something real. But it is only one part of the story.

In short, a model can have low training loss and still generate poor text because next-token prediction measures local fit to training data, while real generation quality depends on generalization, long-horizon behavior, decoding, and downstream alignment with what users actually want.