What is a large language model (LLM), and how is it different from earlier language models?

Question

Accepted Answer

A language model is a system that estimates how likely a sequence of tokens is or, more practically, predicts which token should come next given a context.

A large language model (LLM) is a modern language model that is trained on a very large text corpus and has a very large number of learnable parameters. In current practice, LLMs are usually based on the transformer architecture and trained with a self-supervised objective such as next-token prediction.

This may sound modest at first, but it has an important consequence: once a model becomes good at predicting tokens over many different kinds of text, it can often be adapted to many tasks without designing a separate model for each one. For example, the same pretrained model can later be used for question answering, summarization, text classification, code generation, or instruction following via prompting or finetuning.

How is this different from earlier language models?

Earlier language models were usually much more limited in either their architecture, scale, or both.

Traditional statistical language models, such as n-gram models, estimated probabilities from counts of short token sequences. These models are conceptually simple, but they only use a fixed-size context and suffer from sparsity. If a particular phrase was rare or never appeared in the training data, the model had little basis for handling it well.

Later neural language models, including recurrent neural networks (RNNs) and LSTMs, improved on this by learning dense vector representations instead of relying purely on counts. However, they process tokens sequentially, which makes training less parallelizable and often makes it harder to scale them efficiently to very large datasets and model sizes.

Modern LLMs differ from these earlier approaches in several important ways:

They are typically based on transformers with self-attention, which model relationships between tokens more flexibly than count-based methods and are easier to scale in training than recurrent models.
They are trained at much larger scale in terms of data, parameters, and compute.
They are often used as general-purpose pretrained models that can be adapted to many downstream tasks, rather than being built for only one narrow task from the start.
They can capture longer-range context much more effectively than classic n-gram models, which are constrained by a short fixed window.

So, the key difference is not just that LLMs are “bigger.” They combine large-scale pretraining, transformer-based architectures, and general-purpose transfer in a way that earlier language models usually did not.

In short, an LLM is a large transformer-based language model trained on vast amounts of text to predict tokens. Earlier language models, by contrast, were often count-based or recurrent, much smaller in scale, and generally less effective as flexible, multi-purpose language systems.