Positional Embeddings: An Intuitive Guide

Positional Embeddings are a term that come up often in discussions about transformers, and can seem scary at first. In this article we will explore intuitively what they are, why they’re useful, and how they work!
Note: This article will be a bit math intense, to explain the actual inner workings. You have been informed!
Setting the Stage
Let us say we have two sentences. “The cat chased the dog” and “The dog chased the cat”. In each case, the animal being chased changes purely on the order of the words, despite the words themselves remaining the same. How do we teach a computer to understand that despite the words themselves being the same, their order matters?
A naive approach would be to take the first animal we see, but should the sentence be “The dog was chased by the cat”, this approach fails.
Thus, to teach a computer that the order of words matters, we employ the use of positional embeddings.
This article is Part 1 of 2 in a series about Transformers, and the Attention is All You Need paper. The transformer architecture is what lies at the heart of almost every large language model today, it allows them to understand highly complex relations that natural language has, letting LLMs understand human speech extremely well. In this article, we will understand how a computer might learn to understand the order of words. Positional embeddings are very tightly related to transformers, and it is beneficial to understand them before understanding transformers.
What is an Embedding?
A vector embedding, also known plainly as embedding, can be considered a point in a high dimensional space. Imagine a point in 3D space, but scaled to thousands of dimensions, not just three.
A vector with 1024 dimensions would look like [x1, x2, x3, ….. x1024]
Why embeddings?
You might be asking yourself, if my end goal is to associate words with order, and some sense of what comes first, why can’t we assign direct numbers to it? In the sentence “The cat chased the dog”, why can’t we just label the words as 1, 2, 3, 4, and 5? This may work, but it does not fully capture the meaning behind the words.
“Why doesn’t it capture the meaning?”
Labelling words with numbers does not work for 2 main reasons:
Language isn't linear. The relationship between word 1 and word 2 is not the same as the one between word 99 and word 100. Language is very non linear, and scalar values cannot express language to mean what it actually says.
Does not generalise. Sentences that are longer than what the model is trained on will follow the same pattern, and new insights are not gained.
Core Working
Building the Intuition
Before getting into the solution, let's clarify what we mean by controlling the "locations" of words (which we'll now call tokens). We need a way to represent a token's position in a sentence that captures two important aspects:
Coarse-Grained Location (General Location): This refers to the general region of the sentence where a token appears. Is it at the beginning, in the middle, or towards the end? This provides a broad sense of order. For example, we want to know if a token is generally in the first half or the second half of the sentence. This helps with understanding long-range dependencies (relationships between words that are far apart).
Fine-Grained Location (Specific Location): This refers to the precise position of the token relative to its immediate neighbors. Is it immediately before or after another specific token? This is essential for understanding the local syntactic structure and distinguishing between similar phrases. For example, "the cat chased the dog" vs. "the dog chased the cat" has the same coarse-grained structure (beginning, middle, end) but very different fine-grained structure.
Our goal is to create a system that allows the model to easily distinguish between both these levels of positional information. We need a way to encode both the general "neighborhood" a token is in and its exact address within that neighborhood.
This is where Sinusoidal Embeddings come into the picture. While Sinusoidal Embeddings aren’t the only way to create positional embeddings, this is what was used in the original Vaswani et al. paper on transformers.
Laying the Groundwork
The equation for positional embeddings is given by:
$$PE_{(pos, 2i)} = \sin \left( \frac{\text{pos}}{10000^{\frac{2i}{d}}} \right)$$
$$PE_{(pos, 2i+1)} = \cos \left( \frac{\text{pos}}{10000^{\frac{2i}{d}}} \right)$$
These may look scary, but let’s deconstruct it slowly step by step. These equations define the values in the positional embedding. I.e. The value of the element at 2i’th position is given by the formula for that position. To insert these values we:
Start with a blank vector. Typically of size d dimensions It is filled with zeros initially.
We will assign all even indexes (0, 2, 4, …. ) to sine values, and all odd indexes (1, 3, 5, …) to cosine values
Therefore, all even odd indices (2i+1) will have cosine values and even indices will have sine values (2i)
Let’s deconstruct the formula:
The term ‘pos’ is a scalar, representing the token’s index. (Token 1 has pos 0, token 2 has pos 1, so on.)
‘i’ is the index of the current place we are calculating the embedding for (2i or 2i+1)
‘d’ is the dimension of the model, generally same as embedding vector dimension.
Denominator: This is responsible for the exponentially decreasing values, as we move further along i, the later values get very close to 0. 10000 is a value that acts as a hyperparameter. It affects the range of frequencies we get. The value 10000 itself is generally used as the denominator, other values may be used but this is the most common one. The ‘2i/d’ in it’s exponent acts as a scaling factor. This is what allows for fine vs coarse grained control