Rotary Positional Embeddings (RoPE)

RoPE became the default positional recipe in modern LLMs because it fits naturally into attention and scales better than the older learned absolute-position setup.

Instead of adding a separate position embedding vector to the token representation, RoPE rotates the query and key vectors in a position-dependent way. That keeps positional information tied directly to the attention mechanism, which is why RoPE took over once Llama-style architectures became mainstream.

Architecture gallery

Llama 3 architecture figure showing RoPE as the positional encoding choice — Representative gallery figure showing RoPE as the standard positional choice in a mainstream dense decoder.

Efficiency target

Positional encoding that integrates cleanly with attention and context scaling tricks

Main tradeoff

Cleaner extrapolation than absolute embeddings, but still one more operation inside attention

Typical models

Llama, Qwen, GPT-OSS, DeepSeek, GLM-4.5, Sarvam 30B

Why It Exists

Attention by itself has no notion of order. Early GPT architectures solved that with learned absolute positional embeddings added to the token vectors. RoPE changes the place where position enters the computation: not at the token embedding input, but inside the attention mechanism itself.

That shift turned out to be important because it gave labs a positional scheme that played better with longer contexts and later context-extension tricks like interpolation or YaRN-style scaling.

What Changes In Attention

RoPE rotates the query and key vectors using position-dependent sine and cosine terms. The practical effect is that relative position gets encoded directly into the dot product used by attention. So instead of adding a separate learned vector for “position 512,” the attention computation itself becomes position-aware.

This is also why RoPE is usually discussed together with q and k rather than with the embedding layer. It lives in the attention path, not the token-embedding path.

Why It Became Standard

RoPE sits in a good middle ground. It is not as bare as NoPE, which removes explicit position information entirely, but it is more flexible than learned absolute embeddings. That is why most mainstream dense and sparse decoders still say RoPE in the architecture figure unless they are doing something more unusual.

Recent models then stack extra variations on top. Some use partial RoPE on only a subset of dimensions, while others mix RoPE and NoPE across different layer types.

How To Read It In The Gallery

On the gallery page, RoPE is the quiet default. You will mostly notice it when a card says something more specific like partial RoPE, NoPE + RoPE, or RoPE only on sliding-window layers. Those are the cases where a model is no longer following the plain mainstream recipe.

Sources

From GPT-2 to gpt-oss From-scratch RoPE implementation Llama 3 RoPE scaling notebook

Related concepts

NoPE QK-Norm GQA