Skip to main content

Command Palette

Search for a command to run...

Positional Embeddings: An Intuitive Guide

Updated
7 min read
Positional Embeddings: An Intuitive Guide

Positional Embeddings are a term that come up often in discussions about transformers, and can seem scary at first. In this article we will explore intuitively what they are, why they’re useful, and how they work!

Note: This article will be a bit math intense, to explain the actual inner workings. You have been informed!

Setting the Stage

Let us say we have two sentences. “The cat chased the dog” and “The dog chased the cat”. In each case, the animal being chased changes purely on the order of the words, despite the words themselves remaining the same. How do we teach a computer to understand that despite the words themselves being the same, their order matters?

A naive approach would be to take the first animal we see, but should the sentence be “The dog was chased by the cat”, this approach fails.

Thus, to teach a computer that the order of words matters, we employ the use of positional embeddings.

This article is Part 1 of 2 in a series about Transformers, and the Attention is All You Need paper. The transformer architecture is what lies at the heart of almost every large language model today, it allows them to understand highly complex relations that natural language has, letting LLMs understand human speech extremely well. In this article, we will understand how a computer might learn to understand the order of words. Positional embeddings are very tightly related to transformers, and it is beneficial to understand them before understanding transformers.

What is an Embedding?

A vector embedding, also known plainly as embedding, can be considered a point in a high dimensional space. Imagine a point in 3D space, but scaled to thousands of dimensions, not just three.

A vector with 1024 dimensions would look like [x1, x2, x3, ….. x1024]

Why embeddings?

You might be asking yourself, if my end goal is to associate words with order, and some sense of what comes first, why can’t we assign direct numbers to it? In the sentence “The cat chased the dog”, why can’t we just label the words as 1, 2, 3, 4, and 5? This may work, but it does not fully capture the meaning behind the words.

“Why doesn’t it capture the meaning?”

Labelling words with numbers does not work for 2 main reasons:

  1. Language isn't linear. The relationship between word 1 and word 2 is not the same as the one between word 99 and word 100. Language is very non linear, and scalar values cannot express language to mean what it actually says.

  2. Does not generalise. Sentences that are longer than what the model is trained on will follow the same pattern, and new insights are not gained.

Core Working

Building the Intuition

Before getting into the solution, let's clarify what we mean by controlling the "locations" of words (which we'll now call tokens). We need a way to represent a token's position in a sentence that captures two important aspects:

Coarse-Grained Location (General Location): This refers to the general region of the sentence where a token appears. Is it at the beginning, in the middle, or towards the end? This provides a broad sense of order. For example, we want to know if a token is generally in the first half or the second half of the sentence. This helps with understanding long-range dependencies (relationships between words that are far apart).

Fine-Grained Location (Specific Location): This refers to the precise position of the token relative to its immediate neighbors. Is it immediately before or after another specific token? This is essential for understanding the local syntactic structure and distinguishing between similar phrases. For example, "the cat chased the dog" vs. "the dog chased the cat" has the same coarse-grained structure (beginning, middle, end) but very different fine-grained structure.

Our goal is to create a system that allows the model to easily distinguish between both these levels of positional information. We need a way to encode both the general "neighborhood" a token is in and its exact address within that neighborhood.

This is where Sinusoidal Embeddings come into the picture. While Sinusoidal Embeddings aren’t the only way to create positional embeddings, this is what was used in the original Vaswani et al. paper on transformers.

Laying the Groundwork

The equation for positional embeddings is given by:

$$PE_{(pos, 2i)} = \sin \left( \frac{\text{pos}}{10000^{\frac{2i}{d}}} \right)$$

$$PE_{(pos, 2i+1)} = \cos \left( \frac{\text{pos}}{10000^{\frac{2i}{d}}} \right)$$

These may look scary, but let’s deconstruct it slowly step by step. These equations define the values in the positional embedding. I.e. The value of the element at 2i’th position is given by the formula for that position. To insert these values we:

  1. Start with a blank vector. Typically of size d dimensions It is filled with zeros initially.

  2. We will assign all even indexes (0, 2, 4, …. ) to sine values, and all odd indexes (1, 3, 5, …) to cosine values

Therefore, all even odd indices (2i+1) will have cosine values and even indices will have sine values (2i)

Let’s deconstruct the formula:

  1. The term ‘pos’ is a scalar, representing the token’s index. (Token 1 has pos 0, token 2 has pos 1, so on.)

  2. ‘i’ is the index of the current place we are calculating the embedding for (2i or 2i+1)

  3. ‘d’ is the dimension of the model, generally same as embedding vector dimension.

  4. Denominator: This is responsible for the exponentially decreasing values, as we move further along i, the later values get very close to 0. 10000 is a value that acts as a hyperparameter. It affects the range of frequencies we get. The value 10000 itself is generally used as the denominator, other values may be used but this is the most common one. The ‘2i/d’ in it’s exponent acts as a scaling factor. This is what allows for fine vs coarse grained control

You can use the above widget to see what different embeddings might look like for a particular token!

To put this in context:

  1. This is what the index graph looks like, it is a line from 0 to 128, one for each dimension.

  2. Since the exponent is in denominator, it means that the value will exponentially decrease. We see that the scaling factor reduces very rapidly as we approach the higher dimension values of d. This is also called inverse frequency

  3. We multiply this with position, and pass it as the argument to sin and cos to find its embedding. (Remember, half numbers are generated by sine and the other by cosine:

    Congratulations! We have just replaced scalar 20 with a vector embedding. Now to find use for this.

  4. By calculating the embedding for pos=100 (in this case 100th token), results in:

  5. Therefore the overlapping graph becomes like:

Thus we see that the waves start very chaotic. This is because the inverse frequency is extremely high at the beginning and then rapidly falls off and thus the value stabilises. This sort of difference where the values of the wave are very erratic at the beginning but more control towards the end allows us to have fine grained versus coarse grained control over where to find words.
Lower frequencies, ie high inverse frequencies, (left side) help capture long-range dependencies
Higher frequencies, ie lower inverse frequencies (right side) help distinguish nearby positions
This is the final result of scaling it to 1000 tokens and 128 dims:

What’s the Use Case?

What are its offerings?

Having a way to generate embeddings at scale comes with unique advantages. Some of the key advantages are:

  1. With sinusoidal embeddings, the embedding for position pos + k (where k is any offset) can be calculated as a linear transformation of the embedding for position pos. This means there's a matrix, Mₖ (which depends only on the offset k), that you can multiply the original embedding by to get the shifted embedding: PE(pos + k) = Mₖ * PE(pos). This is the biggest advantage. We will not go into the depth of deriving this, but what this means is that we do not have to store the vector for each token. This reduces a significant memory load.

  2. It generalises well. If stuff outside of training data comes, like really long messages, it is still able to create embeddings for them.

How is it specifically used?

Positional embeddings are incredibly important to the transformer architecture, as highlighted in the Attention is All You Need paper. We will be going over this in depth in the next article

These embeddings are not limited to text however, as they are used in image generation models like Stable Diffusion. These models progressively create an image from noise, and each denoising step (timestep) can be thought of as a token. This helps us to distinguish between different moments in the inference process and allows us to change how much we denoise per step

Summary

Positional embeddings are a highly useful tool to infer the order of words and how they are arranged. By calculating this, we are able to teach this non linear information to a neural network and open the doors to the realm of Generative AI.