A

Can you explain how transformer attention mechanisms work in large language models?

18:30

C
Claude18:30· 1.8s· 312 tokens

The attention mechanism in transformers is the core innovation that made modern LLMs possible. Here's how it works:

Self-Attention (Scaled Dot-Product Attention)

For each token in your input sequence, the model computes three vectors:

  • Query (Q): What this token is "looking for"
  • Key (K): What this token "offers" to others
  • Value (V): The actual content to aggregate

The attention score between token i and token j is computed as:

python
attention_score = softmax(Q_i · K_j^T / sqrt(d_k)) * V_j

Multi-Head Attention

Rather than running attention once, transformers run it in parallel across multiple "heads" — each learning different relationship patterns:

python
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O
where head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V)

Why it matters: This allows every token to attend to every other token in O(n²) time, capturing long-range dependencies that RNNs struggled with. The sqrt(d_k) scaling prevents gradient vanishing when dimensions are large.

How does positional encoding fit into this? And why do newer models use RoPE instead of sinusoidal?

18:32

C
Claude18:33· 2.2s· 428 tokens

Great follow-up. Positional encoding is necessary because attention is permutation-invariant by default — without it, "cat sat on mat" and "mat on sat cat" would produce identical representations.

Original Sinusoidal Encoding (Vaswani et al. 2017)

Adds fixed sine/cosine waves of different frequencies to token embeddings:

python
PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Works, but the position information is added *before* attention — it doesn't interact with Q/K directly.

Rotary Position Embedding (RoPE)

Used in LLaMA, Mistral, Gemini, and most modern models. Instead of adding position to embeddings, it rotates the Q and K vectors by an angle proportional to position:

python
# Rotating query vector q at position m
q_rotated = q * cos(m*θ) + rotate_half(q) * sin(m*θ)

Why RoPE wins: 1. Relative positions emerge naturally — the dot product of Q and K only depends on their *relative* offset, not absolute positions 2. Better length generalization — models can extrapolate to sequences longer than training length 3. No learned parameters — purely geometric, no extra weights

This is why GPT-4, Claude, and Gemini can handle 100k+ context windows effectively.

Press Enter to send ·Shift+Enter for new line

UniAI may produce inaccurate information. Verify important facts independently.