I love being a data scientist working in Natural Language Processing (NLP) right now. The breakthroughs and developments are occurring at an unprecedented pace. From the super-efficient ULMFiT framework to Google’s BERT, NLP is truly in the midst of a golden era.
And at the heart of this revolution is the concept of the Transformer. This has transformed the way we data scientists work with text data – and you’ll soon see how in this article.
Want an example of how useful Transformer is? Take a look at the paragraph below:
The highlighted words refer to the same person – Griezmann, a popular football player. It’s not that difficult for us to figure out the relationships among such words spread across the text. However, it is quite an uphill task for a machine.
Capturing such relationships and sequence of words in sentences is vital for a machine to understand a natural language. This is where the Transformer concept plays a major role.
Note: This article assumes a basic understanding of a few deep learning concepts:
Sequence-to-sequence (seq2seq) models in NLP are used to convert sequences of Type A to sequences of Type B. For example, translation of English sentences to German sentences is a sequence-to-sequence task.
Recurrent Neural Network (RNN) based sequence-to-sequence models have garnered a lot of traction ever since they were introduced in 2014. Most of the data in the current world are in the form of sequences – it can be a number sequence, text sequence, a video frame sequence or an audio sequence.
The performance of these seq2seq models was further enhanced with the addition of the Attention Mechanism in 2015. How quickly advancements in NLP have been happening in the last 5 years – incredible!
These sequence-to-sequence models are pretty versatile and they are used in a variety of NLP tasks, such as:
Let’s take a simple example of a sequence-to-sequence model. Check out the below illustration:
The above seq2seq model is converting a German phrase to its English counterpart. Let’s break it down:
Despite being so good at what it does, there are certain limitations of seq-2-seq models with attention:
The Transformer in NLP is a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease. The Transformer was proposed in the paper Attention Is All You Need. It is recommended reading for anyone interested in NLP.
Quoting from the paper:
“The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.”
Here, “transduction” means the conversion of input sequences into output sequences. The idea behind Transformer is to handle the dependencies between input and output with attention and recurrence completely.
Let’s take a look at the architecture of the Transformer below. It might look intimidating but don’t worry, we will break it down and understand it block by block.
The above image is a superb illustration of Transformer’s architecture. Let’s first focus on the Encoder and Decoder parts only.
Now focus on the below image. The Encoder block has 1 layer of a Multi-Head Attention followed by another layer of Feed Forward Neural Network. The decoder, on the other hand, has an extra Masked Multi-Head Attention.
The encoder and decoder blocks are actually multiple identical encoders and decoders stacked on top of each other. Both the encoder stack and the decoder stack have the same number of units.
The number of encoder and decoder units is a hyperparameter. In the paper, 6 encoders and decoders have been used.
Let’s see how this setup of the encoder and the decoder stack works:
An important thing to note here – in addition to the self-attention and feed-forward layers, the decoders also have one more layer of Encoder-Decoder Attention layer. This helps the decoder focus on the appropriate parts of the input sequence.
You might be thinking – what exactly does this “Self-Attention” layer do in the Transformer? Excellent question! This is arguably the most crucial component in the entire setup so let’s understand this concept.
According to the paper:
“Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.”
Take a look at the above image. Can you figure out what the term “it” in this sentence refers to?
Is it referring to the street or to the animal? It’s a simple question for us but not for an algorithm. When the model is processing the word “it”, self-attention tries to associate “it” with “animal” in the same sentence.
Self-attention allows the model to look at the other words in the input sequence to get a better understanding of a certain word in the sequence. Now, let’s see how we can calculate self-attention.
I have divided this section into various steps for ease of understanding.
1. First, we need to create three vectors from each of the encoder’s input vectors:
These vectors are trained and updated during the training process. We’ll know more about their roles once we are done with this section
2. Next, we will calculate self-attention for every word in the input sequence
3. Consider this phrase – “Action gets results”. To calculate the self-attention for the first word “Action”, we will calculate scores for all the words in the phrase with respect to “Action”. This score determines the importance of other words when we are encoding a certain word in an input sequence
So, z1 is the self-attention vector for the first word of the input sequence “Action gets results”. We can get the vectors for the rest of the words in the input sequence in the same fashion:
Self-attention is computed not once but multiple times in the Transformer’s architecture, in parallel and independently. It is therefore referred to as Multi-head Attention. The outputs are concatenated and linearly transformed as shown in the figure below:
According to the paper “Attention Is All You Need”:
“Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.”
You can access the code to implement Transformer here.
Transformer is undoubtedly a huge improvement over the RNN based seq2seq models. But it comes with its own share of limitations:
So how do we deal with these pretty major issues? That’s the question folks who worked with Transformer asked. And out of this came Transformer-XL.
Transformer architectures can learn longer-term dependency. However, they can’t stretch beyond a certain level due to the use of fixed-length context (input text segments). A new architecture was proposed to overcome this shortcoming in the paper – Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context.
In this architecture, the hidden states obtained in previous segments are reused as a source of information for the current segment. It enables modeling longer-term dependency as the information can flow from one segment to the next.
Think of language modeling as a process of estimating the probability of the next word given the previous words.
Al-Rfou et al. (2018) proposed the idea of applying the Transformer model for language modeling. As per the paper, the entire corpus can be split into fixed-length segments of manageable sizes. Then, we train the Transformer model on the segments independently, ignoring all contextual information from previous segments:
This architecture doesn’t suffer from the problem of vanishing gradients. But the context fragmentation limits its longer-term dependency learning. During the evaluation phase, the segment is shifted to the right by only one position. The new segment has to be processed entirely from scratch. This evaluation method is unfortunately quite compute-intensive.
During the training phase in Transformer-XL, the hidden state computed for the previous state is used as an additional context for the current segment. This recurrence mechanism of Transformer-XL takes care of the limitations of using a fixed-length context.
During the evaluation phase, the representations from the previous segments can be reused instead of being computed from scratch (as is the case of the Transformer model). This, of course, increases the computation speed manifold.
You can access the code to implement Transformer-XL here.
We all know how significant transfer learning has been in the field of computer vision. For instance, a pre-trained deep learning model could be fine-tuned for a new task on the ImageNet dataset and still give decent results on a relatively small labeled dataset.
Language model pre-training similarly has been quite effective for improving many natural language processing tasks: (https://paperswithcode.com/paper/transformer-xl-attentive-language-models and https://paperswithcode.com/paper/transformer-xl-attentive-language-models).
The BERT framework, a new language representation model from Google AI, uses pre-training and fine-tuning to create state-of-the-art models for a wide range of tasks. These tasks include question answering systems, sentiment analysis, and language inference.
BERT uses a multi-layer bidirectional Transformer encoder. Its self-attention layer performs self-attention in both directions. Google has released two variants of the model:
BERT uses bidirectionality by pre-training on a couple of tasks — Masked Language Model and Next Sentence Prediction. Let’s discuss these two tasks in detail.
BERT is pre-trained using the following two unsupervised prediction tasks.
According to the paper:
“The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-to-right language model pre-training, the MLM objective allows the representation to fuse the left and the right context, which allows us to pre-train a deep bidirectional Transformer.”
The Google AI researchers masked 15% of the words in each sequence at random. The task? To predict these masked words. A caveat here – the masked words were not always replaced by the masked tokens [MASK] because the [MASK] token would never appear during fine-tuning.
So, the researchers used the below technique:
Generally, language models do not capture the relationship between consecutive sentences. BERT was pre-trained on this task as well.
For language model pre-training, BERT uses pairs of sentences as its training data. The selection of sentences for each pair is quite interesting. Let’s try to understand it with the help of an example.
Imagine we have a text dataset of 100,000 sentences and we want to pre-train a BERT language model using this dataset. So, there will be 50,000 training examples or pairs of sentences as the training data.
Architectures like BERT demonstrate that unsupervised learning (pre-training and fine-tuning) is going to be a key element in many language understanding systems. Low resource tasks especially can reap huge benefits from these deep bidirectional architectures.
Below is a snapshot of a few NLP tasks where BERT plays an important role:
We should really consider ourselves lucky as so many state-of-the-art advancements are happening in NLP at such a rapid pace. Architectures like Transformers and BERT are paving the way for even more advanced breakthroughs to happen in the coming years.
I encourage you to implement these models and share your work in the comments section below. And if you have any feedback on this article or any doubts/queries, then do let me know and I will get back to you.
You can also take the below course to learn or brush up your NLP skills:
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
10 Nov 23 • 08:00pm
Excellent post Prateek.
Glad you liked it!
NIce Article Explaining all things from sequence model to BERT
Thanks Ashwin!
This is awesome
Very good and intuitive explanation!!
Superb !!
Excellent detailed explanation ! thank you very much
Very good blog. Learnt the basics of Transformers very well. Thank You Prateek.