No Positional Embeddings (NoPE)

NoPE is the counterintuitive idea that a transformer does not necessarily need explicit positional embeddings at all.

In practice, recent LLMs rarely go fully all-in on NoPE. Instead, they use it selectively in some layers while keeping RoPE in others. That compromise is what makes the idea practically interesting today: it is less about removing position everywhere and more about asking where explicit position is really necessary.

Architecture gallery

Annotated figure from the NoPE paper showing length generalization behavior — Annotated figure from the NoPE paper, reused in *The Big LLM Architecture Comparison*.

Efficiency target

Better length behavior in selected layers by removing explicit positional injections

Main tradeoff

Weaker explicit positional signal, so most practical models use it only selectively

Typical models

SmolLM3, Tiny Aya, Arcee Trinity, Sarvam 105B

Why It Exists

The standard transformer story says that attention is order-agnostic, so it needs explicit position information. NoPE challenges that by asking whether the causal mask already gives the model enough directional structure for some tasks and some layers.

In the original NoPE framing, the attraction was stronger length generalization. In current LLM practice, the attraction is more pragmatic: if you can omit RoPE in some layers without paying too much in quality, you get a cleaner and sometimes more robust long-context recipe.

Why Causal Order Still Survives

Even without explicit positional embeddings, autoregressive models still use a causal mask. That means token t can only attend to earlier positions, not future ones. So there is still an implicit ordering signal in the computation, even though the model is no longer being told, directly, “this is position 1,024.”

That is the core intuition: NoPE removes explicit positional injection, not the directional structure of causal generation.

Why Labs Use It Selectively

The SmolLM3 page in the architecture article is the clearest example. The team did not remove RoPE everywhere. Instead, they omitted it in every fourth layer. Tiny Aya and Arcee follow a similar philosophy, often pairing NoPE with full/global attention layers while keeping RoPE on local or sliding-window layers.

That selective usage is the practical lesson. NoPE currently behaves more like a tuning dial inside a mixed stack than like a universal positional replacement.

How To Read It In The Gallery

On the gallery page, NoPE usually appears in phrases like periodic NoPE layers or NoPE + RoPE. Those labels mean the model is mixing positional strategies across layer types rather than making one global decision for the entire stack.

Sources

Architecture article Tiny Aya implementation notes Tiny Aya notebook with selective RoPE and NoPE-style layers

Related concepts

RoPE Sliding Window Attention Gated Attention