November 4, 2024

Finish. That. Idiom. NLP exercise.

Given this dataset

Spreadsheet Screenshot

We will need to predict the next word when given the first 3 words

Why?

Because it was an assignment for my Discrete Mathmatics class.

Data Prep

To make this work, we need to encode each word into a format our neural network can understand. This is done through one-hot encoding, where each unique word is represented by a binary vector. For instance, the word "Read" might be encoded as [1, 0, 0, ...] while another word like "hand" might be [0, 1, 0, ...]. This encoding is essential for numerical processing in machine learning.

It's a pretty simple function:

def one_hot_encode(word, vocabulary):
    vector = np.zeros(len(vocabulary))
    if word in vocabulary:
        vector[vocabulary[word]] = 1
    return vector

We use this function to build to list, one that contains a vector for each of the first 3 words, and another list for that contains the vectors for the last word. We will use these later for training. You can see this code:

feature_train = []
output_train = []

for line in d:
    line = line.split("\t")
    input = [line[1], line[2], line[3]]
    output = line[4]

    feature_train.append(process_input(input, vocabulary))
    output_train.append(one_hot_encode(output, vocabulary))

Tensorflow

We created a neural network with two hidden layers (64 and 32 neurons). This setup strikes a balance between capturing enough complexity to recognize word patterns and keeping things efficient. The next layer uses a softmax activation to produce probabilities for each word in our vocabulary.

model = tf.keras.Sequential([
    layers.Dense(64, input_shape=(input_size,), activation='relu'),
    layers.Dense(32, activation='relu'),
    layers.Dense(output_size, activation='softmax')
])

We followed this up by trainig using the feature_train and output_train lists.

Predicting

To predict the next word, we pass the three input words, encoded in the same one-hot format, into the model. The model outputs a probability for each word in the vocabulary, and the word with the highest probability becomes our predicted fourth word.

For example, if the input is ['Read', 'between', 'the'], the model might predict the next word as lines. The predict_next_word function handles this by using the model’s output probabilities to select the most likely word.

def predict_next_word(model, words, vocabulary):
    X = process_input(words, vocabulary)
    X = np.expand_dims(X, axis=0)
    prediction = model.predict(X)
    predicted_index = np.argmax(prediction)
    
    reverse_vocab = {v: k for k, v in vocabulary.items()}
    return reverse_vocab[predicted_index]


# user input will be the first 3 word and output will be the final word
words = ['Read', 'between', 'the']

print("Idiom:", " ".join(words),"...")
final_word = predict_next_word(model, words, vocabulary)
print("Prediction:", final_word)

Test it for yourself

Run it on google colab.