A simple overview
Unlike images, which come in these regular shape of pixel values, the text is difficult. There may have long sentences, short sentences. So, what will happen if we want to process individual characters like alphabets or to process words? How do we deal with that?
For example, take a word say, Dog. How does it turn into a set of numbers that we can feed into the neural network? A big question, yet! Then if we have Dog, then what’s about Cat? And whats about all of these kinds of things! Similarly, what will happen when we face multiple lengths of sentences as well. How do we deal while padding sentences? What will happen if we’ve got a statement of words that we are going to use for training, and then another statement of words that we want to actually predict? How do we deal with the vocabulary tokens? How to load in the texts, preprocess it, and set up our data so it can be fed into a neural network. Fortunately, Google introduces an open-source library TensorFlow to process texts for natural language processing. All we know is that neural networks generally deal in numbers. The function of Zeros, calculation of weights, and biases that number. So how do we convert our texts into numbers, and in a sensible way?
To build models we’ll focus on text and will see how we can build a classifier based on text models. To find the sentiment in text, and to build models that understand texts that are trained on labeled text, and then can then classify new text based on what they’ve seen.
In the above image, we could take character encodings for each character in a set. For example, the ASCII values. But will that help us understand
the meaning of a word?
Consider the word ‘DEAR’ as shown here. A common simple character
encoding is ASCII, which is the American Standard Code for Information Interchange with the values as shown. So, the word ‘DEAR’ is encoded
using these values. This could be demonstrated using the word “READ”, which has a very different and almost opposite meaning, but with exactly
the same letters. It seems that training a neural network with just the letters
could be a crucial task. Hence, we are considering words, just to give
words a value and have those values used in training a network.
Now, take an example of this sentence, “I Love my daughter.”
Now, putting a value to each word. Whatever that value is that doesn’t matter. It’s just that we are giving a value per word, and the value is the same for the same word for each time. So the simple encoding for the sentence would be the word ‘I’ the value one. Following on, we could give the words ‘Love’, ‘my’ and ‘daughter’, the values 2, 3, and 4 respectively. Now, the sentence, ‘I love my daughter’, would be encoded as 1, 2, 3, 4. Now, what if we have another sentence, ‘I love my Son?’. Well, we’ve already encoded the words ‘I love my’ as 1, 2, 3. So we can reuse those, and we can create a new token for the word ‘Son’, which we haven’t seen before. Let’s put that as the number 5.
Now if we just look at the two sets of encodings, we can get some similarities between the sentences.
I love my daughter is 1, 2, 3, 4 and
I love my Son is 1, 2, 3, 5.
So this is at least the beginning and how we can start training a neural
network based on words. Fortunately, TensorFlow and Keras give us some APIs that make it very simple to do this. and the code is here:
from tensorflow.keras.preprocessing.text import Tokenizer
sentences = [
‘I love my daughter’,
‘I, love my son’,
tokenizer = Tokenizer(num_word = 100)
index_word = tokenizer.index_word