Unreasonable Effectiveness of Counting

Approximating deep learning with simple counts*

Counting is simple and effective. Anyone who’s analyzed data can speak to its ubiquity. It’s useful as a first pass at data as well as inputs to more sophisticated models.

Deep learning is currently the most popular framework for artificial intelligence (AI). In contrast to counting, they are considered opaque black boxes that do some mysterious computations; combine this with analogies to the human brain and you have the perfect storm to titillate peoples’ imaginations.

Language Modeling

A language model is essentially a probability distribution over a sequence words.

A canonical example is the phrase: "how are you". A language model should score English sounding sentences higher.

how are you (:+1:)

you are how (:-1:)

The language model is constructed from probabilities of words and phrases. Probabilities are simply a ratio of counts. The following is the probability of a “how” relative to the entire dataset.

prob(how) = count(how) / count(*)

The word probabilities can be combined to get a sentence-level language model score via simple multiplication.

prob(how are you) = prob(how) * prob(are) * prob(you)

Since word order isn’t taken into account, prob(you are how) would score equally. Phrase probabilities can be used to take into account word order. The following is the probability of the “are” where “how” is the previous word.

prob(are | how) = prob(how are) / prob(how)

This is the probability of “are” given (“|”) the context “how”. This should be much higher than prob(how | are) since no one says “are how”, ostensibly.

These probabilities can be extended to an arbitrary phrase length. At the sentence-level, the probabilities factor out to the following:

prob(how are you) = prob(you | how are) * prob(are | how) * prob(how)

These values can be smoothed out to perform better but conceptually, all language models can be represented as a composition of simple counts.

Recurrent Neural Network (RNN)

A recurrent neural network is a flavor of deep learning that approximates the language model. But let’s keep it ominous — as a black box.

Blur & Contrast

At FoxType , we have a dev tool to play around with various features of a language model. To compare, we trained a count and RNN language model in parallel on the same dataset. The dev tool scores an arbitrary sentence for each word.

Let’s say we want to detect unlikely words in the context of the sentence. Stacking the models together gives us a good feel of how the words interact.

Guess which one is which or maybe it doesn’t matter. Lower the better.

We found that many of the sentences we tested felt similar (or not significantly different enough).

This is a common thread in machine learning where more complicated models are rivaled by simpler models. Even the venerable word2vec algorithm is well-approximated by count tables.

An RNN can theoretically capture infinite context so a naive expectation would be to use it as a silver bullet for language. But realistically, an RNN would have to predict each word in the sequence, which is hard. All of a sudden, a tweet could have a search space larger than go (much less chess). This speaks more to the infinite variation of language than the failure of deep learning.

This is not to say deep learning for language modeling isn’t useful. There are legitimate trade-offs from memory-use to interpretability. The point is, they are still closer to a calculator than actual intelligence. Now that deep learning is so accessible , there is no reason not to try it out.

Back to raw counts.

One of the strengths of language models is that it generalizes. It’s also a weakness. The sentence above is obviously weird and the language model highlights a few possible words to look at.

We can complement this information by adding bigram counts. A bigram is a sequence of two words.

count(i can) = 276151

We can stack the bigram counts on top of the language model scores.

Language model scores and bigram counts. Commas (‘,’) were not included in the dictionary by mistake.

This gives us a different perspective on the data. The counts of “your trouble” is comparatively low.

Even better, one of the trigrams shows that “solve your trouble” is an absurd sequence of words. This makes it very clear that “solve your trouble” is the problem.

count(solve your trouble) = 0

Counts are transparent and it’s very useful to bound the problem. There’s no distinction between “i can solve” and “solve your trouble” in the RNN because the latter is normalized to oblivion.

So we may cheekily say:

count language model + raw counts > RNN language model

Of course, not every example is this trivial. Deep learning can really shine when there are many interacting variables. But this shows that counting can take you a long way.