embedding lookup tables

one thing in theano i couldn't immediately find examples for was a simple embedding lookup table, a critical component for anything with NLP. turns out that it's just one of those things that's so simple no one bothered writing it down :/

this type of packing of the data into matrices is crucial to enable linear algebra libs and GPUs to really fire up.

a trivial full end to end example

consider the following as-simple-as-i-can-think-up "network" that uses embeddings;

given 6 items we want to train 2d embeddings such that the first two items have the same embeddings, the third and fourth have the same embeddings and the last two
have the same embeddings. additionally we want all other combos to have different embeddings.

plotting this shows the convergence of the embeddings (labels denote initial embedding location)...

0 & 1 come together, as do 2 & 3 and 4 & 5. ta da!

a note on dot products

it's interesting to observe the effect of this (somewhat) arbitrary cost function i picked.

for the pairs where we wanted the embeddings to be same the cost function, \( |1 - a \cdot b | \), is minimised when the dotproduct is 1 and this happens when the vectors
are the same and have unit length. you can see this is case for pairs 0 & 1 and 4 & 5 which have come together and ended up on the unit circle. but what about 2 & 3?
they've gone to the origin and the dotproduct of the origin with itself is 0, so it's maximising the cost, not minimising it! why?

it's because of the other constraint we added. for all the pairs we wanted the embeddings to be different the cost function, \( |0 - a \cdot b | \), is minimised when
the dotproduct is 0. this happens when the vectors are orthogonal. both 0 & 1 and 4 & 5 can be on the unit sphere and orthogonal but for them to be both orthogonal
to 2 & 3 they have to be at the origin. since my loss is an L1 loss (instead of, say, a L2 squared loss) the pair 2 & 3 is overall better at the origin because it
gets more from minimising this constraint than worrying about the first.

the pair 2 & 3 has come together not because we were training embeddings to be the same but because we were also training them to be different.
this wouldn't be a problem if we were using 3d embeddings since they could all be both on the unit sphere and orthogonal at the same time.

you can also see how the points never fully converge. in 2d with this loss it's impossible to get the cost down to 0 so they continue to get bumped around. in 3d, as
just mentioned, the cost can be 0 and the points would converge.

a note on inc_subtensor optimisation

there's one non trivial optimisation you can do regarding your embeddings that relates to how sparse the embedding update is.
in the above example we have 6 embeddings in total and, even though we only update 2 of them at a time, we are calculating the
gradient with respect to the entire t_E matrix. the end result is that we calculate (and apply) a gradient that for the majority of rows is just zeros.