We’re on the cusp of deep learning for the masses. You can thank Google later

Google (s goog) silently did something revolutionary on Thursday. It open sourced a tool called word2vec, prepackaged deep-learning software designed to understand the relationships between words with no human guidance. Just input a textual data set and let underlying predictive models get to work learning.

“This is a really, really, really big deal,” said Jeremy Howard, president and chief scientist of data-science competition platform Kaggle. “… It’s going to enable whole new classes of products that have never existed before.” Think of Siri on steroids, for starters, or perhaps emulators that could mimic your writing style down to the tone.

Advertisement

When deep learning works, it works great

To understand Howard’s excitement, let’s go back a few days. It was Monday and I was watching him give a presentation in Chicago about how deep learning was dominating the competition in Kaggle, the online platform where organization present vexing predictive problems and data scientists compete to create the best models. Whenever someone has used a deep learning model to tackle one of the challenges, he told the room, it has performed better than any model ever previously devised to tackle that specific problem.

But there’s a catch: deep learning is really hard. So far, only a handful of teams in hundreds of Kaggle competitions have used it. Most of them have included Geoffrey Hinton or have been associated with him.

Hinton is a University of Toronto professor who pioneered the use of deep learning for image recognition and is now a distinguished engineer at Google, as well. What got Google really interested in Hinton — at least to the point where it hired him — was his work in an image-recognition competition called ImageNet. For years the contest’s winners had been improving only incrementally on previous results, until Hinton and his team used deep learning to improve by an order of magnitude.

Neural networks: A way-simplified overview

Deep learning, Howard explained, is essentially a bigger, badder take on the neural network models that have been around for some time. It’s particularly useful for analyzing image, audio, text, genomic and other multidimensional data that doesn’t lend itself well to traditional machine learning techniques.

Neural networks work by analyzing inputs (e.g., words or images) and recognizing the features that comprise them as well as how all those features relate to each other. With images, for example, a neural network model might recognize various formations of pixels or intensities of pixels as features.

A very simple neural network. Source: Wikipedia Commons

Trained against a set of labeled data, the output of a neural network might be the classification of an input as a dog or cat, for example. In cases where there is no labeled training data — a process called self-taught learning — neural networks can be used to identify the common features of their inputs and group similar inputs even though the models can’t predict what they actually are. Like when Google researchers constructed neural networks that were able to recognize cats and human faces without having been trained to do so.

Stacking neural networks to do deep learning

In deep learning, multiple neural networks are “stacked” on top of each other, or layered, in order to create models that are even better at prediction because each new layer learns from the ones before it. In Hinton’s approach, each layer randomly omits features — a process called “dropout” — to minimize the chances the model will overfit itself to just the data upon which it was trained. That’s a technical way of saying the model won’t work as well when trying to analyze new data.

So dropout or similar techniques are critical to helping deep learning models understand the real causality between the inputs and the outputs, Howard explained during a call on Thursday. It’s like looking at the same thing under the same lighting all the time versus looking at it in different lighting and from different angles. You’ll see new aspects and won’t see others, he said, “But the underlying structure is going to be the same each time.”

An example of what features a neural network might learn from images. Source: Hinton et al

Still, it’s difficult to create accurate models and to program them to run on the number of computing cores necessary to process them in a reasonable timeframe. It’s also can be difficult to train them on enough data to guarantee accuracy in an unsupervised environment. That’s why so much of the cutting-edge work in the field is still done by experts such as Hinton, Jeff Dean and Andrew Ng, all of whom had or still have strong ties to Google.

There are open source tools such as Theano and PyLearn2 that try to minimize the complexity, Howard told the audience on Monday, but a user-friendly, commercialized software package could be revolutionary. If data scientists in places outside Google could simply (a relative term if ever there was one) input their multidimensional data and train models to learn it, that could make other approaches to predictive modeling all but obsolete. It wouldn’t be inconceivable, Howard noted, that a software package like this could emerge within the next year.

Enter word2vec

Which brings us back to word2vec. Google calls it “an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words.” Those “architectures” are two new natural-language processing techniques developed by Google researchers Tomas Mikolov, Ilya Sutskever, and Quoc Le (Google Fellow Jeff Dean was also involved, although modestly, he told me.) They’re like neural networks, only simpler so they can be trained on larger data sets.

Kaggle’s Howard calls word2vec the “crown jewel” of natural language processing. “It’s the English language compressed down to a list of numbers,” he said.

Word2vec is designed to run on a system as small as a single multicore machine (Google tested its underlying techniques over days across more than 100 cores on its data center servers). Its creators have shown how it can recognize the similarities among words (e.g., the countries in Europe) as well as how they’re related to other words (e.g., countries and capitals). It’s able to decipher analogical relationships (e.g., short is to shortest as big is to biggest), word classes (e.g., carnivore and cormorant both relate to animals) and “linguistic regularities” (e.g., “vector(‘king’) – vector(‘man’) + vector(‘woman’) is close to vector(‘queen’)).

Source: Google

Right now, the word2vec Google Code page notes, “The linearity of the vector operations seems to weakly hold also for the addition of several vectors, so it is possible to add several word or phrase vectors to form representation of short sentences.”

This is accomplished by turning words into numbers that correlate with their characteristics, Howard said. Words that express positive sentiment, adjectives, nouns associated with sporting events — they’ll all have certain numbers in common based on how they’re used in the training data (so bigger data is better).

Smarter models means smarter apps

If this is all too esoteric, think about these methods applied to auto-correct or word suggestions in text-messaging apps. Current methods for doing this might be as simple as suggesting words that are usually paired together, Howard explained, meaning a suggestion is could be based solely on the word immediately before it. Using deep-learning-based approaches, a texting app could take into account the entire sentence, for example, because the app would have a better understanding of what the all words really mean in context.

Maybe you could average out all the numbers in a tweet, Howard suggested, and get a vector output that would accurately infer the sentiment, subject and level of formality of the tweet. Really, the possibilities are limited only to the types of applications people can think up to take advantage of word2vec’s deep understanding of natural language.

An example output file from word2vec that has grouped similar words

The big caveat, however, is researchers and industry data scientists still need to learn how to use word2vec. There hasn’t been a lot of research done on how to best use these types of models, Howard said, and the thousands of researchers working on other methods of natural language processing aren’t going to jump ship to Google’s tools overnight. Still, he believes the community will come around and word2vec and its underlying techniques could make all other approaches to natural language processing obsolete.

And this is just the start. A year from now, Howard predicts, deep learning will have surpassed a whole class of algorithms in other fields (i.e., things other than speech recognition, image recognition and natural language processing), and a year after that it will be integrated into all sorts of software packages. The only questions — and they’re admittedly big ones — is how smart deep learning models can get (and whether they’ll run into another era of hardware constraints that graphical processing units helped resolve earlier this millennium) and how accessible software packages like word2vec can make deep learning even for relatively unsophisticated users.

“Maybe in 10 years’ time,” Howard proposed, “we’ll get to that next level.”

We are using an implementation of Google’s algorithms as backups to our own simple text message chat bot. The scope of the training for our BizTexter app is limited but it seems to be answer questions pretty well for our users.

It is interesting that Howard “believes the community will come around and word2vec and its underlying techniques could make all other approaches to natural language processing obsolete”, but doesn’t that contradict the No free lunch theorem? Or could it be that there are other alternatives to deep learning, that perhaps no one has discovered about yet?

I had a basic question on the code they released, perhaps you can point me in the right direction? I got it working, and poked around at the output, but at a fundamental level: It includes “both Continuous Bag-of-Words (CBOW) and the Skip-gram model” options when absorbing the text.

Do you know the difference? AKA: When to use one vs. the other?
Googling on the phrases returns lots of scholarly articles on either, but no consolidated “CBOW is for ___ while Skip-gram is for ____”

Although I cannot pretend to fully understand the scientific concepts at work here, I can certainly see why Google would be interested in them. Google already uses technology like this to recognise individual faces in a sea of photographs in Picasa. I am sure they’d love to use something similar to get better relationships between images in Google Images. But the commercial holy grail right now would be to use deep learning techniques to better understand and interpret the content of pages being crawled by their search engine – ie, to be able to readily identify well written, authoritative content on a subject without having to rely so much on social signals. It also makes me wonder how interested organisations like Homeland Security might be in technology of this kind …

I think a lot of you are missing the point of this article, or you simply feel the need to advertise your own companies and products. He hasn’t said that deep learning was invented by google. He didn’t say that there aren’t any existing deep learning implementations. As a matter of fact, he mentioned two existing open source deep learning projects.

As Ian Goodfellow points out there’s actually quite a bit of open source software packages for deep learning. Personally I released DeepLearnToolbox ( https://github.com/rasmusbergpalm/DeepLearnToolbox ) two years ago.
It’s nice to see the technology being picked up by the big players in the industry though.

The work going on at Google is great, but this is hardly the first time a deep learning tool has been open sourced. Plenty of open source deep learning tools have existed for years. An example of one that I have been involved with personally is pylearn2 from Yoshua Bengio’s lab:https://github.com/lisa-lab/pylearn2

Thanks for the comments, Elliot and Dave. Both of your products seem interesting; I’ll have to learn more about them. Regarding Google, I think it’s leading on the research front and probably will for a while, if only because it has so much data and computing power, and so many smart people( including, now, Geoff Hinton).

But, yeah, it’s definitely a good sign that we have services like yours starting to come online.

Yes, Google is definitely doing lots to legitimize deep learning–without their interest, many of the companies we’re talking to would not have known to even look there. Because neural networks have such a long and disappointing history, many ML researchers and practitioners have been leery of the idea of using them, even if the newer methods are significantly more advanced (and tons more effective) than what came before. But more and more, the proof that they work really well for certain very difficult problems is hard to deny.

What I think we’re going to see is Google lead the research charge for exactly the reasons you suggest, but it’s going to unfold something like the relational database did– IBM research invents it in 1970, 7 years later Oracle and a handful of other startups are founded with a focus on a practical RDBMS, fast forward to the future and it’s weird *not* to use a relational database to store structured data–and IBM is still but only a small part of a much larger and complex market. The difference this time is that it will be a bigger and faster change–the world is bigger and more globalized in every conceivable way than it was in the 70s and good ideas just travel faster–even, these days, in the enterprise.

This is actually a pretty good article about deep learning, I just wish it wasn’t such a google love fest… Deep learning is nothing new (the “big breakthrough” was 7 years ago, in 2006) and google is hardly the only goup and/or company doing amazing things with the technology.

Alchemy API, for instance, recently added deep learning techniques to their products. Ersatz (my company’s deep learning paas) is *entirely* dedicated to deep learning and has been in beta since January–we work with many types of data for many types of problems, not just nlp.

Deep learning is hard to do right, very powerful, and absolutely the way future “holy crap, they can do that?” technologies will be built–but google is hardly the only company doing serious work in this area.

This is definitely interesting — but the idea of training word embeddings isn’t really that new — and the idea certainly wasn’t invented at Google. R. Collobert released embeddings some 4+ years ago, and Richard Socher from Stanford released open source code for training word vector representations back in 2012 (available on his website) as did Eric H. Huang (available on GitHub). There have been word vectors available and tools to train them in the OSS community for years. T. Mikolov also released tools back in 2010.

There are lots of ways to train these sort of embeddings (feed-forward neural nets, recurrent nets, computationally “cheaper” approaches such as skip-bigram) and Google’s tool employs just a few of the available approaches (each has its own benefits and drawbacks). It can do skip bigram and CBOW, but better results can be achieved with things like RNNLMs (if you have the computational horsepower).

Deep learning is extremely exciting stuff — it’s revolutionizing performance in many areas of AI (speech recognition, natural language processing, computer vision, etc) but the “newness” of open source tools to train word embeddings was really some years back. As the author of this post rightfully pointed out, the real magic isn’t just in creating these word vectors, but how you actually use them. It’s true that just throwing these things into a SVM gives some interesting results (see work from J. Turian for example #s) but there are much more advanced techniques that give significantly more mileage.

Full disclosure: I’m the CEO of AlchemyAPI, a company that has been working in the deep learning field for some time (see our TechCrunch article from earlier this year) hence my inherent bias and opinionated nature on this subject :)

The topic “deep learning for the masses still seems pretty spot on” though, of course we stand of the shoulders of those before us but if there’s been another case of easy to use deep learning tools pre-trained on 100 million articles I’m ignorant of it. This is the kind of thing most companies would keep to themselves as a competitive advantage and I can’t help but read in a bit of that in the replies talking this down. In any case we all agree that deep learning seems to be the way we’re heading.