Language is a thing of beauty. But mastering a new language from scratch is quite a daunting prospect. If you’ve ever picked up a language that wasn’t your mother tongue, you’ll relate to this! There are so many layers to peel off and syntaxes to consider – it’s quite a challenge.
And that’s exactly the way with our machines. In order to get our computer to understand any text, we need to break that word down in a way that our machine can understand. That’s where the concept of tokenization in Natural Language Processing (NLP) comes in.
Simply put, we can’t work with text data if we don’t perform tokenization. Yes, it’s really that important!
And here’s the intriguing thing about tokenization – it’s not just about breaking down the text. Tokenization plays a significant role in dealing with text data. So in this article, we will explore the depths of tokenization in Natural Language Processing and how you can implement it in Python.
I recommend taking some time to go through the below resource if you’re new to NLP:
Tokenization is a common task in Natural Language Processing (NLP). It’s a fundamental step in both traditional NLP methods like Count Vectorizer and Advanced Deep Learning-based architectures like Transformers.
Tokens are the building blocks of Natural Language.
Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization.
For example, consider the sentence: “Never give up”.
The most common way of forming tokens is based on space. Assuming space as a delimiter, the tokenization of the sentence results in 3 tokens – Never-give-up. As each token is a word, it becomes an example of Word tokenization.
Similarly, tokens can be either characters or subwords. For example, let us consider “smarter”:
But then is this necessary? Do we really need tokenization to do all of this?
As tokens are the building blocks of Natural Language, the most common way of processing the raw text happens at the token level.
For example, Transformer based models – the State of The Art (SOTA) Deep Learning architectures in NLP – process the raw text at the token level. Similarly, the most popular deep learning architectures for NLP like RNN, GRU, and LSTM also process the raw text at the token level.
As shown here, RNN receives and processes each token at a particular timestep.
Hence, Tokenization is the foremost step while modeling text data. Tokenization is performed on the corpus to obtain tokens. The following tokens are then used to prepare a vocabulary. Vocabulary refers to the set of unique tokens in the corpus. Remember that vocabulary can be constructed by considering each unique token in the corpus or by considering the top K Frequently Occurring Words.
Creating Vocabulary is the ultimate goal of Tokenization.
One of the simplest hacks to boost the performance of the NLP model is to create a vocabulary out of top K frequently occurring words.
Now, let’s understand the usage of the vocabulary in Traditional and Advanced Deep Learning-based NLP methods.
As discussed earlier, tokenization can be performed on word, character, or subword level. It’s a common question – which Tokenization should we use while solving an NLP task? Let’s address this question here.
Word Tokenization is the most commonly used tokenization algorithm. It splits a piece of text into individual words based on a certain delimiter. Depending upon delimiters, different word-level tokens are formed. Pretrained Word Embeddings such as Word2Vec and GloVe comes under word tokenization.
But, there are few drawbacks to this.
Drawbacks of Word Tokenization
One of the major issues with word tokens is dealing with Out Of Vocabulary (OOV) words. OOV words refer to the new words which are encountered at testing. These new words do not exist in the vocabulary. Hence, these methods fail in handling OOV words.
But wait – don’t jump to any conclusions yet!
Another issue with word tokens is connected to the size of the vocabulary. Generally, pre-trained models are trained on a large volume of the text corpus. So, just imagine building the vocabulary with all the unique words in such a large corpus. This explodes the vocabulary!
This opens the door to Character Tokenization.
Character Tokenization splits apiece of text into a set of characters. It overcomes the drawbacks we saw above about Word Tokenization.
Drawbacks of Character Tokenization
Character tokens solve the OOV problem but the length of the input and output sentences increases rapidly as we are representing a sentence as a sequence of characters. As a result, it becomes challenging to learn the relationship between the characters to form meaningful words.
This brings us to another tokenization known as Subword Tokenization which is in between a Word and Character tokenization.
Subword Tokenization splits the piece of text into subwords (or n-gram characters). For example, words like lower can be segmented as low-er, smartest as smart-est, and so on.
Transformed based models – the SOTA in NLP – rely on Subword Tokenization algorithms for preparing vocabulary. Now, I will discuss one of the most popular Subword Tokenization algorithm known as Byte Pair Encoding (BPE).
Byte Pair Encoding (BPE) is a widely used tokenization method among transformer-based models. BPE addresses the issues of Word and Character Tokenizers:
BPE is a word segmentation algorithm that merges the most frequently occurring character or character sequences iteratively. Here is a step by step guide to learn BPE.
We will understand the steps with an example.
Consider a corpus:
1a) Append the end of the word (say </w>) symbol to every word in the corpus:
1b) Tokenize words in a corpus into characters:
2. Initialize the vocabulary:
Iteration 1:
3. Compute frequency:
4. Merge the most frequent pair:
5. Save the best pair:
Repeat steps 3-5 for every iteration from now. Let me illustrate for one more iteration.
Iteration 2:
3. Compute frequency:
4. Merge the most frequent pair:
5. Save the best pair:
After 10 iterations, BPE merge operations looks like:
Pretty straightforward, right?
But, how can we represent the OOV word at test time using BPE learned operations? Any ideas? Let’s answer this question now.
At test time, the OOV word is split into sequences of characters. Then the learned operations are applied to merge the characters into larger known symbols.
– Neural Machine Translation of Rare Words with Subword Units, 2016
Here is a step by step procedure for representing OOV words:
Let’s see all this in action next!
We are now aware of how BPE works – learning and applying to the OOV words. So, its time to implement our knowledge in Python.
The python code for BPE is already available in the original paper itself (Neural Machine Translation of Rare Words with Subword Units, 2016)
Reading Corpus
We’ll consider a simple corpus to illustrate the idea of BPE. Nevertheless, the same idea applies to another corpus as well:
Text Preparation
Tokenize the words into characters in the corpus and append </w> at the end of every word:
Compute the frequency of each word in the corpus:
Output:
Let’s define a function to compute the frequency of a pair of character or character sequences. It accepts the corpus and returns the pair with its frequency:
Now, the next task is to merge the most frequent pair in the corpus. We will define a function that accepts the corpus, best pair, and returns the modified corpus:
Next, its time to learn BPE operations. As BPE is an iterative procedure, we will carry out and understand the steps for one iteration. Let’s compute the frequency of bigrams:
Output:
Find the most frequent pair:
Output: (‘e’, ‘s’)
Finally, merge the best pair and save to the vocabulary:
Output:
We will follow similar steps for certain iterations:
Output:
The most interesting part is yet to come! That’s applying BPE to OOV words.
Applying BPE to OOV word
Now, we will see how to segment the OOV word into subwords using learned operations. Consider OOV word to be “lowest”:
Applying BPE to an OOV word is also an iterative process. We will implement the steps discussed earlier in the article:
Output:
As you can see here, the unknown word “lowest” is segmented as low-est.
Tokenization is a powerful way of dealing with text data. We saw a glimpse of that in this article and also implemented tokenization using Python.
Go ahead and try this out on any text-based dataset you have. The more you practice, the better your understanding of how tokenization works (and why it’s such a critical NLP concept). Feel free to reach out to me in the comments below if you have any queries or thoughts on this article.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
10 Nov 23 • 08:00pm
It would be good if you also provide a link to download the "sample.txt" file.
Hi, Download the sample corpus from here