5 technologies that will help big data cross the chasm

We’re on the cusp of a real turning point for big data. Its applications are becoming clearer, its tools are getting easier and its architectures are maturing in a hurry. It’s no longer just about log files, clickstreams and tweets. It’s not just about Hadoop and what’s possible (or not) with MapReduce.

With each passing day, big data is becoming more about creativity — if someone can think of an application, they can probably build it. That makes the concept of big data a lot more tangible and a lot more useful to a lot more companies, and it makes the market for big data a lot more lucrative.

Advertisement

Here are five technologies helping spur a shift in thinking from “Why would I want to use some technology that Yahoo built? And how?” to “We have problem that needs solving. Let’s find the right tool to solve it.”

Apache Spark

When it comes to open source big data projects, they don’t get much hotter than Apache Spark. The data-processing framework is garnering a lot of users and a lot of supporters — including from Hadoop vendors MapR and Cloudera — because it promises to be almost everything for Hadoop deployments (arguably the foundation of most enterprise big data environments) that MapReduce wasn’t. It’s fast, it’s easy to program and it’s flexible.

We’ll talk a lot more about the future of cloud computing at out Structure conference June 18 and 19 in San Francisco. Speakers include Amazon CTO Werner Vogels, Google SVP and Technical Fellow Urs Hölzle, and Microsoft EVP Scott Guthrie. Also, Airbnb VP Mike Curtis will discuss how that company runs big data workloads in the cloud, and New York Times Chief Data Scientist Chris Wiggins will talk about the newspaper’s work in machine learning.

Sensors

A lot of talk about sensors focuses on the volume and speed at which they generate data, but what’s often ignored is the strategic decisions that go into choosing the right sensors to gather the right data. If there’s are real-world measurements that need to be taken, or events that need to be logged, there’s probably a fairly inexpensive sensor available to do the job. Sensors are integral to smarter cars, of course, but also to everything from agriculture to hospital sanitation.

And if there’s not a usable sensor commercially available, it’s not inconceivable to build one from scratch. A team of university researchers, for example, built a cheap sensor to measures the wing speed of insects using a cheap laser pointer and digital recorder. It helped them capture more, better data than previous researchers, resulting in a significantly more-accurate model for classifying bugs.

The setup used to measure the insects’ data.

That type of creativity highlights what’s possible thanks to the convergence of sensors, consumer electronics, big data, and, presumably, the maker movement and 3-D printing. If more, different and better data will lead to better analysis, it’s easier than ever to collect it yourself rather than wait for someone else to do it.

Artificial intelligence

Thanks to the proliferation of data in the form of photos, videos, speech and text, there’s now an incredible amount of effort going into building algorithms and systems that can help computers understand those inputs. From a big data perspective, the interesting thing about these approaches — whether they’re called deep learning, cognitive computing or some other flavor of artificial intelligence — is that they’re not yet really about analytics in the same way so many other big data projects are.

AI researchers aren’t so concerned — yet — with uncovering trends or finding the needle in the haystack as they are with automating tasks that humans can already do. The big difference, of course, is that, done right, the systems can perform tasks such as object or facial recognition, or text analysis, much faster and at a much greater scale than humans can. As they get more accurate and require less training, these systems could power everything from intelligent ad platforms to much smarter self-driving cars.

Remarkably, the techniques for doing all this stuff are being democratized at rapid clip and will soon be accessible to a lot more people via software, open source libraries and even APIs. Google and Facebook are spending hundreds of millions of dollars advancing the state of the art in AI, but anyone brave enough to give it a whirl can get their hands on similar capabilities for very little money, if not free.

Quantum computing

Commercial quantum computing is still a way off, but we can already see what might be possible when it arrives. According to D-Wave Systems, the company that has sold prototype versions of its quantum computer to Google, NASA and Lockheed Martin, it’s particularly good at advanced machine learning tasks and difficult optimization problems. Google is testing out computer vision algorithms that could eventually run on smartphones; Lockheed is trying to improve software validation for flight systems.

It’s powerful stuff that could help companies of all stripes solve some difficult computing and analytic tasks that today’s most-advanced systems and techniques can’t. Or, at least, quantum computing should be able to solve those problems faster and more efficiently.

Before that can happen, though, mainstream businesses will need access to quantum resources and some knowledge in how to use them. D-Wave is vowing to make the resources available via the cloud, and is working on compilers to simplify the programming aspect. There’s a lot of ground to cover before that happens, but the technology is moving fast and quantum computer instances delivered via the Amazon Web Services or Google clouds isn’t out of the realm of possibility.

Great list, and Spark definitely belongs here. I think Spark will become a pretty useful hammer in everyoneâ€™s big data toolbox, and the adoption of YARN will help everyone get there. I think that itâ€™ll get way beyond the current ML platforms into interactive queries and even user-facing workloads. Once itâ€™s on the same cluster, sharing the CPU, ram, disk, and network as MapReduce/HBase/others, businesses will have to make a comprehensive performance management plan for all of these components.

I thought that what was developed last year in machine learning and AI was neat, but this year and the next should see even more amazing possibilities through the use of Big Data.

Things like what the Allen Institute has supported: a program that can accurately match diagrams to supporting text in geometry problems; a program that learns EVERYTHING about images (and videos), all unsupervised (needs lots of data!) — concepts, parts, actions, you name it ; and various programs that dramatically improve question-answering technology, and can even handle informal and mal-formed questions. What they’re funding for this coming year looks even more exciting, and includes deep machine reading and commonsense reasoning.

Then there are the upcoming talks at ICML, one of whose titles is “Distributed Representations of Sentences and Documents,” by Quoc Le and Tomas Mikolov. Get that? “sentences and documents” not “words and phrases”.

There are the neural nets being developed by people at Oxford that can already beat Socher’s sentiment classifier. There are several papers about this.

There’s the work of Tom Mitchell and others on fusing brain-scan data with data generated from large corpora to produce better word vectors. It appears that adding just a dash of brain data dramatically improves the quality of word vector representations, and allows the embedding dimension to be much larger without degradation in performance.

There’s the recent work of Zweig on how to set up word vectors to handle antonyms (by using thesaurus data).

d. Here’s an article on a computer vision program that can learn “everything about anything”. It’s unsupervised, and isn’t up to the level of the best supervised algorithms; on the other hand, it’s very general:

One could imagine many uses for this, given enough brain data; for example, maybe it would improve parsers (as in a paper by Baroni), sentiment classifiers (by initializing models with brain+text word vectors), language transation, etc.

10. It will be interesting to see the outcome of the CoNLL “shared task” on unrestricted grammar correction. The task is: you feed in an essay, and the computer returns for you a corrected version, taking into account any type of error. Think of what that could mean for foreign students… or even people who don’t write well.

I love the technology explosion and the fountain of ideas it is unleashing, but given the amount of disruption and make-it-yourself going on, we are a long, long way from even reaching the chasm, much less crossing it. This is still the Early Market, a time for visionary sponsors to underwrite game-changing projects to garner dramatic competitive advantage.

Well, far be it from me to disagree with the guy who wrote the book on crossing chasms ;-) I don’t know the exact time frame (commercial quantum computing, for example, is a few years off, at least), but I do think the applications for big data are becoming much more clear, and the technologies are easier to consume and able to do more of what mainstream companies will expect. As the tech gets commercialized, companies will be able to move from idea to pilot pretty easily.

Reblogged this on Timothy Dockins and commented:
I am positioning myself, it appears, on the forefront of the application of AI to predictive analytics. I have a decent academic background in machine learning where I started focusing on AI in my undergrad and continued it in graduate school. I’m currently researching at the forefront of transfer learning where we try to leverage past learning to improve learning performance on new tasks; all in an effort to learn more and faster. Part of that effort delves into automatically discovering new features about data that isn’t readily apparent. I’ve been looking at Deep Belief Networks, Convolutional Neural Nets, Sparse Coding, and now Restricted Boltzmann Machines. These tools, combined with something like Apache Spark, could really dig deep into some Big Data!