Smooth embeddings for arXiv scientific paper titles

Following up on my recent project creating fake arXiv abstracts with RNNs, I have developed a way to embed titles of papers into a vector space. The way I did this is heavily inspired by this paper by Bowman, et al. I followed a slightly simplified approach, in which I simply try to autoencode the titles with a seq2seq network, and use the hidden state that gets passed from the encoder to the decoder as the embedding. This by itself does not generate very smooth embeddings however, which Bowman et al address by including a variational autoencoder in-between the encoder and decoder RNNs. I was lazy and simply added a small amount of noise to the hidden representation during training, which had a similar effect.

Having such an embedding allows one to to some pretty entertaining things. First of all, one can interpolate between two paper titles, by taking the embeddings of two titles, and sampling a number of points that lie between them. Here is one such example:

signature of antiferromagnetic long-range order in the optical spectrum of strongly correlated potential
signature of antiferromagnetic long-range order in the optical excitations of highly correlated systems
signature of antiferromagnetic order nuclei in the 0d term of quenched systems
existence of antiferromagnetic order nuclei in the static region of mesoscopic systems
existence of self-gravitating one-dimensional rings of the maxwell chain "
existence of self-gravitating random static solutions of the toda system
existence of axially symmetric field solutions of the einstein-vlasov system
existence of axially symmetric static solutions of the einstein-vlasov system

(note that I normalised everything to lower case). Here are some more examples, and here are some examples with more fine-grained sampling between the points.

Another thing that one can do is calculate “analogies” as one can do with word2vec embeddings, such as “king is to queen as man is to woman”, by adding and subtracting their respective vectors, i.e. queen-king+man=woman. This seems to work reasonably well for some examples, and was not reported by Bowman, et al. For example, I got:

which is kind of weird but not too bad. Another try got the network waxing philosophical:

"What is the origin of species?"
- "On the origin of species"
+ "On the theory of relativity"
= "what is theory. a theory"

In general it seems to be able to do substitutions of words quite alright if the positions of the words are similar. Doing this for 3 completely random titles with no obvious relations leads to gibberish output, but I wouldn’t expect anything else.

Future ideas include putting a dense layer before and after the hidden units, in order to get even more robust embeddings (right now they are the states of the 2 layers of RNNs concatenated). Another idea is to somehow separate “semantic” and “syntactic” aspects of the embedding, so that some dimensions would cover the subject matter, and others the grammatical structure that the idea is presented in.