Skip to content
Home » Code with Me » Machine Learning » Day 5 – Introduction to Natural Language Processing

Day 5 – Introduction to Natural Language Processing

⬅️ Day 4 – Image Data Generator

In the last chapter, we discussed how the ImageDataGenerator tool could automatically assign labels to the images and how transformers are used to increase the size of a dataset. You can check my GitHub repository for updates. Today we’ll see how artificial intelligence can be used to understand human-based language.

Natural language processing

Natural language processing(NLP) is a branch of artificial intelligence that gives computers the ability to understand text and spoken words. Some of the examples where we use this technology frequently are chatbots, spell checkers, autocomplete inputs, voice-to-text messaging, etc. NLP helps organizations to analyze thousands of customer interactions related to text and voice to improve their services or products.

Let’s first see how NLP decomposes language into numbers that computers can understand.

Encoding language into numbers

There are several ways to encode language into numbers. One way is to encode using letters. For example, you will use a number to encode the letter ‘a’. But when building models, using numbers for each character might make it difficult to understand the text. Therefore instead of assigning numbers to characters, we assign the numbers to the words.

For example, if we take the word cat, we assign it the value x and for the word dog, it can be value y. Let’s take two sentences. The first one is ‘I love puppies’. We can encode this as [1, 2, 3]. The second sentence can be ‘I love cats’. This will be encoded as [1, 2, 4]. As you can see since the numbering is similar, it’ll be easy for the computers to understand the similarity between the two sentences. This process is called tokenization.

Tokenization

TensorFlow Keras contains a library called preprocessing that provides different tools for preparing data for machine learning. Tokenizer is a tool that will help to convert words into numbers which we call tokens. Let’s see how this actually works. I have taken the earlier example sentences to create tokens for each word in those sentences.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'I love puppies',
    'I love cats'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

With Tokenizer(num_words = 100), we create a new Tokenizer object and specify the maximum number of tokens it can create to 100. Since we are using a small corpus of words this value is enough. Calling fit_on_texts will create the tokenized word index. You’ll see that the printed result will show a set of key/value pairs for the words as below.

{'i': 1, 'love': 2, 'puppies': 3, 'cats': 4}

Let’s add a question sentence as given below and see how the tokenizer will understand it.

sentences = [
    'I love puppies',
    'I love cats',
    'Do you like cats?'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

You’ll see that the tokenizer is flexible enough to identify the question mark and filter out the word ‘cats’. This is controlled by the filters parameter which removes the punctuation by default.

{'i': 1, 'love': 2, 'cats': 3, 'puppies': 4, 'do': 5, 'you': 6, 'like': 7}

Next, let’s see how to encode sentences into sequences of numbers.

Turning sentences into sequences

The tokenizer has a method called text_to_sequences which will encode your sentences into sequences of numbers. Let’s modify the earlier code to try out this method.

sentences = [
    'I love puppies',
    'I love cats',
    'Do you like cats?'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

After running this code block you’ll be given the below output with the sequences representing the 3 sentences.

[[1, 2, 4], [1, 2, 3], [5, 6, 7, 3]]

When training a neural network we feed data and train the model but this data won’t cover 100% of your needs. Similarly, in NLP there can be thousands of words in your training data but when we show the neural network some new data it might get confused giving negative predictions. Let’s see how we can overcome this problem.

Out-of-vocabulary tokens

There can be situations where your neural network is fed with new data. To handle this, we use the out-of-vocabulary (OOV) token. Let’s see what happens if we add new sentences separately as test data and use the same existing tokenizer you have already used earlier.

test_data = [
    'I love rabbits',
    'Would you like cats more?'
]

test_sequences = tokenizer.texts_to_sequences(test_data)
print(word_index)
print(test_sequences)

The result will be as follows.

{'i': 1, 'love': 2, 'cats': 3, 'puppies': 4, 'do': 5, 'you': 6, 'like': 7}
[[1, 2], [6, 7, 3]]

As you can see when you swap back the tokens to words, the sentences have lost their meanings. You can overcome this issue by adding a new parameter called 00v_token when creating the tokenizer object.

tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

test_sequences = tokenizer.texts_to_sequences(test_data)
print(word_index)
print(test_sequences)

This is the output for the above code.

{'<OOV>': 1, 'i': 2, 'love': 3, 'cats': 4, 'puppies': 5, 'do': 6, 'you': 7, 'like': 8}
[[2, 3, 1], [1, 7, 8, 4, 1]]

You’ll notice that the tokens list now has a new item “<OOV>” and the test sentences are maintaining their length as well. If you reverse-encode the sentences now it’ll be “I love <OOV>” and “<OOV> you like cats <OOV>”. This is a better way of representing the sentences in a more meaningful way compared with the earlier output.

Understanding padding

You might remember in the earlier chapters where we used images, we always rescaled them to be the same width and height. Similarly, once we tokenize the words the converted sentences will be of different lengths. To get them to the same shape and size we use padding.

Let’s add a longer sentence to the earlier example and try to understand this.

sentences = [
    'I love puppies',
    'I love cats',
    'I love rabbits',
    'Would you like cats more than dogs?'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

When you sequence the above sentences, you’ll get a list of the numbers having different lengths.

[[1, 2, 4], [1, 2, 3], [1, 2, 5], [6, 7, 8, 3, 9, 10, 11]]

To make these the same length, we use the pad_sequences API. To do that you can import the API and can simply call pad_sequences to convert your sequences into a padded set.

from tensorflow.keras.preprocessing.sequence import pad_sequences

padded = pad_sequences(sequences)
print(padded)

You’ll get a nicely formatted set of sequences as below.

[[ 0  0  0  0  1  2  4]
 [ 0  0  0  0  1  2  3]
 [ 0  0  0  0  1  2  5]
 [ 6  7  8  3  9 10 11]]

The sequence will get padded with the number 0 which is not found in the token list. You can explore this API further to learn other options available to improve your data.

In the next chapter let’s learn another interesting aspect of Natural Language Processing. Until then, happy coding! 😃🔥

Day 6 – Understanding Sequence and Time Series Data ➡️