Using Word2Vec and TSNE

Word2Vec is cool. So is tsne. But trying to figure out how to train a model and reduce the vector space can feel really, really complicated. While working on a sprint-residency at Bell Labs, Cambridge last fall, which has morphed into a project where live wind data blows a text through Word2Vec space, I wrote a set of Python scripts to make using these tools easier.

This tutorial is not meant to cover the ins-and-outs of how Word2Vec and tsne work, or about machine learning more generally. Instead, it walks you through the basics of how to train a model and reduce its vector space so you can move on and make cool stuff with it. (If you do make something awesome from this tutorial, please let me know!)

Above: a Word2Vec model trained on a large language dataset, showing the telltale swirls and blobs from the tsne reduction.

1. INSTALL REQUIRED LIBRARIES

First, you’ll need to install a few libraries to get things running. Luckily, unlike Torch or OpenCV, they’re really pretty easy to install using package managers like pip.

To train your Word2Vec model, you’ll need some plain text input for it to learn from. Larger files, like a Wikipedia dump*, will produce a more robust model but will take way, way longer to train and reduce. A good place to start would be a novel downloaded from Internet Archive.

Keep in mind that misspellings get learned too, so a “clean” file can make a big difference. Also important to think about (though sadly out of the scope here) is that any bias present in your source text will get baked into your Word2Vec model as well. Gender relationships, connections between ideas – Word2Vec captures these from its input the same as any other connections between words. TLDR: it’s worth picking your ap psychology review carefully, and important not to think of a machine learning model as a pure representation of language.

You can put your source text anywhere, though I keep mine in the
ModelsAndData/ folder to keep everything organized, and these scripts will save there too.

* If you do use Wikipedia, you’ll want to strip the wiki tags from the text. There are a few ways to do it, but I suggest Wikipedia Extractor, which is very reliable and makes it really easy. Why reinvent the wheel, right?

2a. OPTIONAL: TAG PARTS-OF-SPEECH

While this won’t be an issue for most projects, you may want finer-grained modeling of language, especially with words that are spelled the same but have different meanings (homonyms). For example, the word “box” can be an object (noun) or an action (verb). To preserve these differences, we can tag the words with their parts-of-speech. When training,
box_NN will be seen as a separate entity from
box_VB .

We can add POS with the help of the pattern library, which does all the heavy lifting, and
TagTextForTraining.py which wraps it up and outputs a text file for training. Open the script and modify it to include your input text file and a new filename to save. Run it in the Terminal – this could take quite a long time, depending on the size of the input.

The resulting file should look something like this, in the format of
<word>_<POS> :

1

2

3

4

5

6

7

8

9

10

Nor_CC

,_,

having_VBG

only_JJ

length_NN

,_,

breadth_NN

,_,

and_CC

thickness_NN

3. TRAIN YOUR MODEL

Open the
TrainModel.py file in a text editor and make some modifications to suit your input. You will want to set:

The reduction is done with the
TwoStageReduce.py script – open it like before and modify the variables as needed.

model_filename : the trained model from the last step

model_name : used to format the name of several output files later

num_dimensions : how many dimensions for the final reduction – 2 will let us visualize the model in an image, so let’s leave it at that

run_init_reduction and
init_dimensions : for large data sets, if we went straight to a 2D tsne our computer would run out of memory and choke; instead, we can do an initial reduction with incremental PCA (a more memory-friendly but less precise method) to make our vector space more manageable before running tsne – a setting of 20D seems about right on my machine

only_most_common and
num_common : we can also reduce our vector space by only keeping the most common words; this is loaded from a file in
ModelsAndData/ and lets us specify how many words to keep – try 10k as a good starting point, then bump it up to 50k if you need it

tagged_pos : set to
True if you trained your model with parts-of-speech; if so, we have to strip the POS before matching to common words

When ready, run your script! This will take the longest of any step – I’ve had it take up to several hours. First it will load your model, then reduce the vocabulary as specified, do an initial reduction, a final reduction, and normalize the vectors to a range of -1 to 1. It will save each of these variations as csv files, making them easy to use for visualizations, etc.

Here’s a sample from the normalized output:

1

2

3

4

5

6

7

8

9

all,0.93686179195,0.529668204145

yellow,0.285336010389,-0.831806126844

four,0.0697747693962,-0.473679892537

ceased,-0.142252643262,0.0181344152395

sleep,-0.692804587806,-0.387154042506

go,-0.456888914722,0.439223057063

follow,0.361306455284,-0.493592481291

abundant,-0.202242506346,-0.106282559638

seemed,0.923450066346,0.248309784978

4a. GRID

Optionally, you may want to convert your vector space into a nice, even grid. This can be helpful for visualizing data that is clumped together, or for things like searching. The
TsneToGrid.py script uses the rasterfairy module (installed in the
lib/ folder, since it can’t be installed with pip). Change the input/output files and run it.

1

2

3

4

5

6

7

8

9

all,21,0

yellow,5,3

four,6,15

ceased,17,18

sleep,11,15

go,14,15

follow,9,11

abundant,22,22

seemed,18,4

The script will print the output dimensions of the grid (such as 25×26 words) which you’ll want to note if you’re doing any kind of visualization or interactive project.

5. VISUALIZE

Data is hard to read, so visualizing the vector space can be really helpful. The included Processing script will load your 2D csv file and output a png file, showing the characteristic tsne blobs and tails (or a grid, if you changed it in the previous step).

Open the sketch, change the input and output filenames, and any other settings you want to change.