Deep Learning Can be Applied to Natural Language Processing

This post is a rebuttal to a recent article suggesting that neural networks cannot be applied to natural language given that language is not a produced as a result of continuous function. The post delves into some additional points on deep learning as well.

There is an article going around the rounds at LinkedIn that attempts to make an argument against the use of Deep Learning in the domain of NLP. The article written by Riza Berkan “Is Google Hyping it? Why Deep Learning cannot be Applied to Natural Languages Easily” has several arguments about DL cannot possibly work and that Google is exaggerating its claims. The latter argument is of course borderline conspiracy theory.

Yannick Vesley has written a rebuttal “Neural Networks are Quite Neat: a Reply to Riza Berkan” where he makes his arguments on each point that Berkan makes. Vesley’s points are on the mark, however one can not ignore the feeling that DL theory has a few unexplained parts in it.

However, before I do get into that, I think it is very important for readers to understand that DL currently is an experimental science. That is, DL capabilities are actually discovered by researchers by surprise. There are certainly a lot of engineering that goes into the optimization and improvement of these machines. However, its capabilities are ‘unreasonably effective’, in short, we don’t have very good theories to explain its capabilities.

It is clear that there are gaps in understanding are in at least 3 open questions:

How is DL able to search high dimensional discrete spaces?

How is DL able to perform generalization if it appears to be performing rote memorization?

How does (1) and (2) arise from simple components?

Berkan’s arguments exploit our current lack of a solid explanation with his own alternative approach. He is arguing that a symbolicist approach is the road to salvation. Unfortunately, no where in his arguments does he reveal the brittleness of the symbolicist approach, the lack of generalization and the lack of scalability. Has anyone created a rule based system that is able to classify images based on low level features that rivals DL? I don’t think so.

DL practitioners, however, aren’t stopping their work just because they don’t have air tight theoretical foundations. DL works and works surprisingly well. DL at is present state is an experimental science and it is absolutely clear that there is something going on underneath the covers that we don’t fully understand. A lack of understanding however does not invalidate the approach.

1. The effective capacity of neural networks is large enough for a brute-force memorization of the entire data set.

2. Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels.

3. Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged.

The point here that surprises most Machine Learning practitioners is the ‘brute-force memorization’. See, ML has always been about curve fitting. In curve fitting you find a sparse set of parameters that describe your curve and you use that to fit the data. The generalization that comes into play relates to the ability to interpolate between points. The major disconnect here is that DL have exhibited impressive generalization, yet it cannot possibly work if we consider them as just memory stores.

However, if we consider them as holographic memory stores, then that problem of generalization has a decent explanation. In “Deep Learning are Holographic Memories” I point out the experimental evidence that:

The Swapout learning procedure which tells us that if you sample any subnetwork of the entire network the resulting prediction will be the similar to any other subnetwork you look sample. Just like holographic memory where you can slice of pieces and still recreate the whole.

As it turns out, the universe itself is driven by a similar theory called the Holographic Principle. In fact, this serves as a very good base camp to begin a more solid explanation of the capabilities of Deep Learning. I introduce the “The Holographic Principle: Why Deep Learning Works” where I introduce a technical approach of using Tensor Networks that performs a reduction of the high dimensional problem space into a space that is computable within acceptable response times.

So going back again to the question about wether NLP can be handled by Deep Learning approaches. We certainly know that it can work, afterall, are you not reading and comprehending this text?

In 2015, Chris Manning, an NLP practitioner wrote about the concerns of the field regarding Deep Learning (see: Computational Linguistics and Deep Learning). It is very important to take note of his arguments since his arguments are not in conflict with the capabilities of Deep Learning. His two arguments why NLP experts need not worry are as follows:

(1) It just has to be wonderful for our field for the smartest and most influential people in machine learning to be saying that NLP is the problem area to focus on; and (2) Our field is the domain science of language technology; it’s not about the best method of machine learning — the central issue remains the domain problems.

The first argument isn’t a criticism of Deep Learning. The second argument explains that he doesn’t believe in one-size-fits-all generic machine learning that works for all domains. That is not in conflict with the above Holographic Principle approach that indicates the importance of the network structure.

To conclude, I hope this article puts an end to the discussion that DL is not applicable to NLP.

If perhaps you still aren’t convinced, then maybe Chris Manning himself should convince you himself:

Bio: Carlos Perez is a software developer presently writing a book on "Design Patterns for Deep Learning". This is where he sources his ideas for his blog posts.