20 Jun 2018

When words become vectors


In order to use the existing mathematical framework to analyse natural language, the language has to be represented in a form which suits this framework – as a set of numbers. One can think of several ways to represent natural language in form of numbers: map every letter in one’s alphabet to a number, map every word in one’s vocabulary to a number (symbolic representation), map every word to a vector in a vector space (one hot vectors), etc. Computational Linguists have tried several of these approaches but the one which has been the most useful and the driving force behind the latest developments in the field of NLP is the one which uses distributed representation of words (phrases) as vectors in a of meaning (the semantic space). Here, distributed stands for smearing of meaning of a word across all dimensions of a vector as opposed to the localization of meaning to one particular dimension as done by one-hot encoding. There is an excellent lecture[1] by Chris Manning where he explains the different ways one could represent words as vectors.

In this post, I will explain and implement the simplest model to find vector representation of words – the word2vec model. Introduced by T. Mikolov in 2013, this was the beginning of a new era in the NLP. The idea is simple – given a word, we want to predict its surrounding words using a model. The words are represented as vectors and the weights of the model are these vectors themselves. Hence, by training this model we obtain a set of weights which are actually the vector representation of words. This post focuses on the implementation of word2vec; hence, if you are not familiar with the model, I would highly recommend watching the lecture[1] by Chris Manning. The post also assumes a decent grasp over the basic Python APIs of tensorflow.

Now that we have the basic understanding of the model, lets get into the specifics of its implementation in tensorflow.

Model Implementation

We need the following components to implement the core word2vec model in tensorflow and as we will see later, these components can be seen as layers of our model:

  1. The input word embeddings (weights): These weights, represented as a matrix \( \boldsymbol{W_i} \), transform the onehot encoded word representation \(\boldsymbol{i}\) into the distributed wordvector representation \( \boldsymbol{v} \).

  2. The output word embeddings (set of weights): These set of matrices \( \boldsymbol{W_{o_1}}, \boldsymbol{W_{o_2}}, … \), hold the embeddings of words when they are in the context i.e., output. The number depends on the size of the context that we choose for the model. For this implementation I have chosen the context to be of two words, one on either side of the input word.

  3. Softmax: To transform the dot product (cosine similarity) scores into probabilities.

  4. Negative log-likelihood/ Cross-entropy loss: The loss function for training the model.


Note: Then names in the code follow the names in the image. However, since the image is large, please right-click and open in new window to see the names of matrices and operations.

import tensorflow as tf
embd_dim = 10 # The dimension of the vector space in which we rep. the words
# create placeholder to feed the input
I = tf.placeholder(tf.float32, shape=(None, vocab_size))
O1 = tf.placeholder(tf.float32, shape=(None, vocab_size))
O2 = tf.placeholder(tf.float32, shape=(None, vocab_size))
# Input and output embeddings
Wi = tf.get_variable("Wi", shape=(vocab_size, embd_dim))
Wo1 = tf.get_variable("Wo1", shape=(embd_dim, vocab_size))
Wo2 = tf.get_variable("Wo2", shape=(embd_dim, vocab_size))

# create the model
Ei = tf.matmul(I, Wi)
So1 = tf.matmul(Ei, Wo1)
So2 = tf.matmul(Ei,Wo2)
#Po1 = tf.nn.softmax(So1)
#Po2 = tf.nn.softmax(So2)
loss1 = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=So1, labels=O1), name="loss1")
loss2 = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=So2, labels=O2), name="loss2")
loss = tf.add(loss1, loss2, name="total_loss")
training_op = tf.train.AdamOptimizer().minimize(loss)

Preprocessing input and output

In order to train this model, we need a corpus of natural language data. I have taken the text about various colors from Wikipedia to form our toy training corpus[2]. You are free to use this data for your experiments.

Loading the data and creating the corpus

from pathlib import Path
from math import floor, ceil
import numpy as np
data_dir = Path('./data')
files = ['wiki_' + color for color in ['black', 'blue', 'brown', 'color', 'cyan', 'green',
                                       'grey', 'indigo', 'magenta', 'orange', 'pink', 'purple',
                                       'red', 'violet', 'white', 'yellow']]
strings = []
for file in files:

corpus = (' '.join(strings)).split()
vocab = set(corpus)
vocab_size = len(vocab)
print("Corpus length : {}".format(len(corpus)))
print("Vocab length : {}".format(len(vocab)))

Corpus length : 28210
Vocab length : 3508

Creating the input and output matrices

As shown in the diagram of the model, we need to create the input (\( \boldsymbol{I} \)) and the output (\( \boldsymbol{O_1}\), \( \boldsymbol{O_2}\) ) matrices. The rows of these matrices are one-hot representations of the triplets of words which form our training samples. For instance if a sample in our training data is (dark, blue, pen) then the corresponding row in input and output matrices will be the one-hot representations of words blue, dark and pen respectively.

word_index = {}
index_word = {}
# create word indices
for i, word in enumerate(vocab):
    word_index[word] = i
    index_word[i] = word

# create input and output samples

def windows(corpus, window_len=3):
    """Creates windows of neighboring words from the copus"""
    corpus_len = len(corpus)
    if corpus_len < window_len:
        raise ValueError("Corpus length cannot be smaller than window length")
    half = int(floor(window_len/2))
    if corpus_len % 2:  # even
        pre_pad = half - 1
        post_pad = half
    else:  # odd
        pre_pad = half
        post_pad = half
    for i in range(pre_pad, corpus_len - post_pad):
        yield corpus[i-pre_pad: i + post_pad + 1]


def get_one_hot(index, vocab_size):
    """Create one hot vectors for words"""
    zeros = np.zeros(vocab_size)
    zeros[index] = 1.
    return zeros

# hardcoding for window_size = 3
vI = []
vO1 = []
vO2 = []

for window in windows(corpus):
    # hard coding for window_size=3
    vI.append(get_one_hot(word_index[window[1]], vocab_size))
    vO1.append(get_one_hot(word_index[window[0]], vocab_size))
    vO2.append(get_one_hot(word_index[window[2]], vocab_size))

vnpI = np.vstack(vI)
vnpO1 = np.vstack(vO1)
vnpO2 = np.vstack(vO2)
print("Shape of Input: {}".format(vnpI.shape))
print("Shape of Output-1: {}".format(vnpO1.shape))
print("Shape of Output-2: {}".format(vnpO2.shape))
Shape of Input: (28208, 3508)
Shape of Output-1: (28208, 3508)
Shape of Output-2: (28208, 3508)


Now its time to train the model. We will create functions which will feed batches of 64 samples during every step in the training loop.

batch_size = 64
epochs = 100
num_samples = vnpI.shape[0]
num_steps = epochs*ceil(num_samples/batch_size)

def shuffle_in_unison(a, b, c):
    assert len(a) == len(b)
    p = np.random.permutation(len(a))
    return a[p], b[p], c[p]

def train_batch(I, O1, O2, batch_size=64, steps=1000):
    for i in range(steps):
        I_batch = np.take(
            I, range(batch_size*i, batch_size*(i+1)), axis=0, mode='wrap')
        O1_batch = np.take(
            O1, range(batch_size*i, batch_size*(i+1)), axis=0, mode='wrap')
        O2_batch = np.take(
            O2, range(batch_size*i, batch_size*(i+1)), axis=0, mode='wrap')
        yield shuffle_in_unison(I_batch, O1_batch, O2_batch)

init = tf.global_variables_initializer()
loss_summary = tf.summary.scalar("loss", loss)
logdir = 'logs/word2vec'
with tf.Session() as sess:
    logwriter = tf.summary.FileWriter(logdir, tf.get_default_graph())
    for step, (i_batch, o1_batch, o2_batch) in enumerate(train_batch(vnpI, vnpO1, vnpO2, batch_size=batch_size, steps=num_steps)):
        sess.run(training_op, feed_dict={
                 I: i_batch, O1: o1_batch, O2: o2_batch})
        if step % 100 == 0:
            summary_str = loss_summary.eval(feed_dict={I: i_batch, O1: o1_batch, O2: o2_batch})
            logwriter.add_summary(summary_str, step)
    summary_str = loss_summary.eval(feed_dict={I: i_batch, O1: o1_batch, O2: o2_batch})
    logwriter.add_summary(summary_str, step)
    saver = tf.train.Saver([Wi, Wo1, Wo2])
    saver.save(sess, logdir)

Visualization of the results

We will use the excellent tensorboard plugin called embedding projector to visualize our word vectors. In order to do that, we need to save the words, in the order of their indices, in a tab separated file.

with open(logdir + '/words.tsv', 'w') as words_file:
    for i in range(vocab_size):

Use $ tensorboard --logdir logs/ to launch Tensorboard. It should display the graph of loss function during our training. Training Progress

To see the word embeddings, we need to open the Projector tab. If you do not see this tab, go to the Inactive dropdown menu and select Projector. This open the projector plugin for you. Now, we need to load the words.tsv file into the projector. Use the load button on the left panel to browse through the file system and load the file.

Select the input matrix

We can use projection techniques like PCA and t-SNE[3] to project our 10 dimensional vectors into 3 dimensions. I have used t-SNE here. This article on distill gives a very quick visual overview of t-SNE and that its hyperparameters.

Visualization using t-SNE


In this post, we have seen a simple implementation of word2vec model in tensorflow. However, it has to be noted that this is a bare bones implementation and is for demonstration of the concepts behind word2vec. By no means is this efficient or optimized to be used in production. There are several efficient and improved versions of this model which include Hierarchical Softmax, Negative Sampling, etc. There is a blog post by Sebastian Ruder which discusses these in greater detail. Moreover, if you are not dealing with domain specific data, the best way to incorporate word vectors in production models would be by using pretrained embeddings like Word2Vec by google or GloVe by Stanford. These are trained on enormous amount of data and are very good for general purpose representation of words. If however, you are dealing with domain specific data and have a big corpus which you want to use for training a word2vec model, then I would recommend using Gensim a special purpose library designed to do just this.

The code used in this post is available on github under MIT License. You are free to use it in your projects. I would appreciate a mention of this post in your work if you do so.


[1] Lecture by Chris Manning

[2] Toy training data

[3]Maaten, Laurens van der, and Geoffrey Hinton. “Visualizing data using t-SNE.” Journal of machine learning research 9.Nov (2008): 2579-2605.