NoPE refers to omitting explicit positional information injection in selected attention layers.

In practice, the relevant modern version is usually not “remove all positional encoding everywhere.” Instead, models tend to omit RoPE only in certain layers and keep it in others.

Annotated NoPE figure about length generalization
The original NoPE motivation was stronger length generalization when explicit positional encodings are removed. Underlying paper: NoPE paper (Original source: The Big LLM Architecture Comparison).

Why NoPE Can Still Work

The obvious question is how a model can keep track of token order if no positional signal is added. The key point, as discussed in the original NoPE paper, is that causal generation still imposes an ordering structure. A token can attend only to earlier tokens, not future ones.

So even without explicit positional embeddings, the computation is still directional. This does not mean that explicit positional information is useless. It only means that some layers may not need as much of it as we might expect.

The Practical Version Is Usually Mixed

SmolLM3 was a good recent example because it does not remove RoPE everywhere. Instead, it omits it in every fourth layer. Tiny Aya and Arcee Trinity use a similar idea in a slightly different way by mixing NoPE and RoPE across different attention patterns.

That is the useful practical takeaway. NoPE is usually not presented as a whole replacement for positional encoding, but as a selective design choice in parts of the stack where explicit position may be less necessary or less convenient.

SmolLM3 architecture showing periodic NoPE layers
SmolLM3 made NoPE easy to notice because the architecture figure marks the cadence explicitly instead of hiding it inside a config detail.

Why Models Still Keep RoPE Elsewhere

Recent hybrid and long-context models often keep RoPE in local or sliding-window attention layers while dropping it in selected global-attention layers. Kimi Linear is one example from the article, where its MLA layers use NoPE, while the positional bias is effectively handled by the linear-attention side.

This is also why NoPE tends to appear together with other architecture choices rather than by itself. In many recent models, it is part of a broader hybrid-attention setup.

Sources