Following up on my recent project creating fake arXiv abstracts with RNNs, I have developed a way to embed titles of papers into a vector space. The way I did this is heavily inspired by this paper by Bowman, et al. I followed a slightly simplified approach, in which I simply try to autoencode the titles with a seq2seq network, and use the hidden state that gets passed from the encoder to the decoder as the embedding. This by itself does not generate very smooth embeddings however, which Bowman et al address by including a variational autoencoder in-between the encoder and decoder RNNs. I was lazy and simply added a small amount of noise to the hidden representation during training, which had a similar effect.

Having such an embedding allows one to to some pretty entertaining things. First of all, one can interpolate between two paper titles, by taking the embeddings of two titles, and sampling a number of points that lie between them. Here is one such example:

signature of antiferromagnetic long-range order in the optical spectrum of strongly correlated potential
signature of antiferromagnetic long-range order in the optical excitations of highly correlated systems
signature of antiferromagnetic order nuclei in the 0d term of quenched systems
existence of antiferromagnetic order nuclei in the static region of mesoscopic systems
existence of self-gravitating one-dimensional rings of the maxwell chain "
existence of self-gravitating random static solutions of the toda system
existence of axially symmetric field solutions of the einstein-vlasov system
existence of axially symmetric static solutions of the einstein-vlasov system

(note that I normalised everything to lower case). Here are some more examples, and here are some examples with more fine-grained sampling between the points.

Another thing that one can do is calculate “analogies” as one can do with word2vec embeddings, such as “king is to queen as man is to woman”, by adding and subtracting their respective vectors, i.e. queen-king+man=woman. This seems to work reasonably well for some examples, and was not reported by Bowman, et al. For example, I got:

which is kind of weird but not too bad. Another try got the network waxing philosophical:

"What is the origin of species?"
- "On the origin of species"
+ "On the theory of relativity"
= "what is theory. a theory"

In general it seems to be able to do substitutions of words quite alright if the positions of the words are similar. Doing this for 3 completely random titles with no obvious relations leads to gibberish output, but I wouldn’t expect anything else.

Future ideas include putting a dense layer before and after the hidden units, in order to get even more robust embeddings (right now they are the states of the 2 layers of RNNs concatenated). Another idea is to somehow separate “semantic” and “syntactic” aspects of the embedding, so that some dimensions would cover the subject matter, and others the grammatical structure that the idea is presented in.

I scraped the entirety of arXiv abstracts to do some experiments. To get started, I trained a char-rnn on all the q-bio abstracts and generated a bunch of synthetic abstracts. Some of the results were quite fun, see below:

Various brain areas reveal spatiotemporal activity patterns that repeat over time: resulting intracellular elements of genetic regulatory networks are quantified. Using a ” experimental study of neural networks, the framework of cellular Markov models to the importance of complexity induces a identification of challenges for understanding specialized biological structures.

Modelling forest composition function for meaningful laws in cortical networks, in the light of simplifying assumption of interaction networks with the same importance they exploit networks used by previous models in topological detail. Existing methods largely depend on a kinetic SIR model under physical networks. We have used the stationary law of overlapping phylogenetic tree distributions as a popular utility. Making use of eigenvalue laws and a scheme augmented along the population and eventually simplify a network .

It also tries to generate LaTeX but it doesn’t get it quite right yet:

Geometry of DNA looping where the residence of 26 ‘ alleles diffusing out than amplitude distributions ( $ F ( x ) $ -test are abrupt at short times $ O ( n = 0.5 ) < $ ^ { 2+ } $ due to a balance matrix , and the synergism of the model and a statistical mechanics level comparable .

I experimented with generating arXiv categories and titles along with the abstracts.

Abstract: In a reply that is robust from male molecules in the ecosystem and have presented to apply it to the city in proteomics evolution. An important entity presently processing a MS/MS spectrum outbreak, monitoring, requires on tests but not only an important difficulty in big datasets, opening gained from the usual graph lens and (human) sensitivity analysis. Future dimension test subjects are valid how the epigenetic basis for protein sequence directionality increases the increase within size state. We corrected review particular methods.

It has been a while since the lastposts from this series, so the title weekly doesn’t really apply. The previous 2 bite-sized commitments to improve my Vim skills worked well, with about 75% of the things I tried to internalize now in daily use. This week I will focus on some movements, sticking to the number 4.

H and L move to top and bottom screen.

ctrl+F and ctlr+B move a page forward and back.
I find the behaviour of this a bit strange, because it places you on the second to last line of the current page.

' followed by a mark moves to the line of that mark.
Useful automatic marks are . for the location of the last edit and ' for the location before the last jump.

nG moves to line n.
I used to use :n for this, but that is not really a movement and thus can’t be combined with other commands.

For more movements also have a look at this Vim Movements Wallpaper. I won’t use it personally, but it makes for a handy reference.

Built a 4-bit synthesizer together with a friend, based on this project. We assembled it on a breadboard, still need to transfer it to a PCB. It is based on an ATMega48 microcontroller, and the sound is generated through a R-2R ladder DAC (the resistors soldered into a chunk in the picture)

Here is a sample of the lo-fi goodness:

The synth is controlled via MIDI, and can generate a single voice in one of 4 waveforms at a time. Several synths can be daisychained to be controlled via one MIDI connector and feed a single output though, so hopefully once we get to soldering everything together we can do that.