What is Positional Encoding and It's Mathematical Formula?
Why Do Transformers Need Positional Encoding?
Unlike RNNs or LSTMs, the Transformer doesn’t inherently process sequences in order. RNNs handle tokens one by one, remembering the previous token as they process the next. Transformers, however, process all tokens simultaneously using self-attention. This simultaneous processing is efficient but creates a dilemma:
How can the Transformer model know where each word is in the sentence if it processes them all at once?
Take this sentence, for example:
"Geoffrey loves exploring AI."
In a self-attention mechanism, the Transformer can understand relationships between words but not their positions. Without knowing the position, it might confuse "Geoffrey" as being the object instead of the subject of the sentence.
Thus, we need to encode positional information so the model understands the order in which words occur.This encoding of positional information of every token in sequence is called Positional Encoding.
How Positional Encoding Works
Now, let's put on our scientist hats and break it down. Positional encoding introduces a clever way to inject sequence information into the input embeddings. Here's the key idea:
Instead of feeding in plain word embeddings, we add positional embeddings, which are vectors representing the position of each token in the sequence.
We use sine and cosine functions to calculate these position vectors. Why? Because sine and cosine create patterns that are smooth, continuous, and allow the model to differentiate between nearby and distant words.
The positional encoding formula is as follows:
$$
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right)
$$
$$
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right)
$$
$$
\text{(pos)}: \text{the position of the token in the sequence.}
$$
$$
\text{(i)}: \text{the dimension in the embedding vector.}
$$
$$
\text{(}d_{\text{model}}\text{)}: \text{the dimension of the model.}
$$
These alternating sine and cosine functions ensure that each position in the sequence has a unique encoding, and the difference between encodings is meaningful.
Let’s take a step back and look at the idea visually. For the first word "Geoffrey", you might get a position vector like [sin(0), cos(0), sin(0), ...], and for the next word "loves", the sine and cosine functions will generate a different positional vector. As we go deeper into the sequence, these patterns provide the model with the necessary clues about token order.
Example Sentence: Positional Encoding in Action
Consider the sentence:
"Artificial Intelligence is transforming the world."
We tokenize it as:
- ["Artificial", "Intelligence", "is", "transforming", "the", "world"]
Without positional encoding, the model would treat each word independently of its position. But with positional encoding, each word gets a unique "position-aware" embedding. The sine and cosine functions generate different values for each word based on its position.
For example, "Artificial" (position 0) might get something like:
$$
[sin(0), cos(0), sin(0), cos(0)]
$$
While "transforming" (position 3) might get:
$$
[sin(3), cos(3), sin(3), cos(3)]
$$
The difference in these embeddings allows the Transformer to understand that "Artificial" comes before "transforming."
Implementing Positional Encoding in Python and Pytorch
Let’s bring this to life with some code, shall we? Below is an implementation of Positional Encoding in Python using PyTorch:
import torch
import math
class PositionalEncoding(torch.nn.Module):
def __init__(self, d_model, max_len=5000):
super(PositionalEncoding, self).__init__()
# Create a long matrix to hold positional encodings
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
# Apply sine to even indices in the embedding dimension
pe[:, 0::2] = torch.sin(position * div_term)
# Apply cosine to odd indices in the embedding dimension
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0).transpose(0, 1)
self.register_buffer('pe', pe)
def forward(self, x):
# Add the positional encodings to the input
return x + self.pe[:x.size(0), :]
In this code:
d_model
: This is the dimensionality of the input embeddings.max_len
: The maximum sequence length we're allowing.sin
andcos
: We apply sine and cosine to alternating positions in the embeddings.
This module generates positional encodings and adds them to the input embeddings. The beauty of this is that, even though the model processes all tokens in parallel, it still understands their positions.
Conclusion: Positional Encoding—The Transformer’s Compass
Without Positional Encoding, the Transformer would be like a ship lost at sea, not knowing which way to go. But thanks to the mathematical elegance of sine and cosine, the model can now understand the sequential nature of language.
So the next time you’re using a Transformer for language tasks like translation or summarization, remember that beneath the layers of attention and neural magic lies a humble sinusoidal pattern, quietly guiding the model through the complexities of token order.
And that, my dear AI enthusiasts, is why we need positional encoding!