Big Data Is Less About Size, And More About Freedom

0

Editor’s note: Big Data has been around for a long time between credit card transactions, phone call records and financial markets. Companies like AT&T, Visa, Bank of America, Ebay, Google, Amazon and more have massive databases they mine for competitive advantage. But lately, Big Data is finding its way to the smallest startups. The Web and cloud computing brings Big Data everywhere. But what exactly is pushing Big Data forward?

We are in a Renaissance for computer science, engineering, and learning from data right now. The scale of data and computations is an important issue, but the data age is less about the raw size of your data, and more about the cool stuff you can do with it. Now that there is so much data, it is time to unlock its value. Really neat things are happening already—like the way the people of the world can educate themselves on all manner of issues and topics, or the way data and computing serves as leverage in other scientific and technical endeavors. There will be lots of amazing stuff on the web, but innovation will come in other domains as well.

The recent big data trend is about the democratization of large data more than its growth. In articles like the Economist’s recent piece on the data deluge, we hear about big data everywhere. We hear about what big data and the cloud mean for the enterprise, but they have had big data for a long time. eBay manages petabytes in its Teradata and Greenplum data warehouses. Sophisticated startups extracting value from big data is also nothing new—it has been happening at least since the days of Yahoo! and Google, and they have done it without the data warehousing folks.

Now focused early stage startups can get up and running faster than ever. Less technical analysts at companies like Facebook and Twitter can access massive amounts of data easily. Even individuals can undertake cool projects with big data, such as Pete Skomoroch of Data Wranglingdid with trending topics for Wikipedia.

Why Now?

We do not have to build all our own hardware and software infrastructure anymore.

Pioneers such as Amazon have given us the cloud, where we have the capability to run very large server clusters at a low startup cost. Pioneers like Google have paved the way for open source projects like Hadoop and HBase, that are backed by big company contributors like Facebook.

The combination has paved the way for a new class of data driven startup like Aardvark (just acquired by Google) and Factual, it has reduced both cost and time to market for these startups, as we showed with Flightcaster. And, it has allowed startups that were not necessarily data driven to become more analytical as they evolved, such as Facebook, LinkedIn, Twitter, and many others.

So we have big data, the cloud, and open source facilitating new data-driven startups. I like to break this trend down from the technical perspective into three chunks; storing data, processing data, and learning from data. I define “learning from data” to mean data mining, AI, machine learning, statistics, and so on.

Supersize my data. Oh wait, I’ll just have a Medium.

The first time I heard the “Medium Data” idea was from Christophe Bisciglia and Todd Lipcon at Cloudera. I think the concept is great. Companies do not have to be at Google scale to have data issues. Scalability issues occur with less than a terabyte of data. If a company works with relational databases and SQL, they can drown in complex data transformations and calculations that do not fit naturally into sequences of set operations. In that sense, the “big data” mantra is misguided at times. For instance, a GigaOm article about big data in the cloud states:

What is becoming increasingly clear is that Big Data is the future of IT. To that end, tackling Big Data will determine the winners and losers in the next wave of cloud computing innovation.

The big issue is not that everyone will suddenly operate at petabyte scale; a lot of folks do not have that much data.

The more important topics are the specifics of the storage and processing infrastructure and what approaches best suit each problem. How much data do you have and what are you trying to do with it? Do you need to do offline batch processing of huge amounts of data to compute statistics? Do you need all your data available online to back queries from a web application or a service API?

Once your data and its processing are large enough to require distributing the data and the work among machines across network boundaries, things get a lot harder. You have to deal with distributed computing and make tradeoffs like a real computer scientist.

Big Data & The Cloud: Viral Buzzwords 4.0!

The cloud, and hosted services, present very interesting opportunities. One of the greatest is that people can leverage the a la carte economics of elastic computing to do things that were prohibitively expensive due to the requirements of building and maintaining their own hardware infrastructure. The interesting parts about the current cloud are its lack of entrance friction and elastic cost efficiency, the speed with which new entrants can set up, and the elastic capability to run 100 machine clusters for 1 hour if that is what is needed.

We started Flightcaster almost a year ago, and it is a good example of how startups can leverage cloud compute and storage resources, mix some open source like Hadoop with some data mining, and create interesting new technologies with relatively low capital upfront.

The cloud is not cheaper in general. Once people scale to a certain point, they move off the cloud onto dedicated hardware—not the other way around. That may change, and better hosted services may play a role in the transition, but that will take a while. In the meantime, the interesting part of the cloud is the use of elastic resources and the ability to get up and going quickly. The interesting part is the freedom it gives startups to try things they would never otherwise do.

Another notable thing about the cloud is the new architectures emerging as a result of economic and resource tradeoffs.

Storage of large amounts of data in the cloud is much cheaper with blobstores like Amazon S3 than it is to maintain an always-up cluster for a distributed datastore. If you do mostly offline batch processing and you do not need bulk storage to be online, then it is an attractive setup.

Storage and NoSQL

A Big Data stack…will also need to emerge before cloud computing will be broadly embraced by the enterprise. In many ways, this cloud stack has already been implemented, albeit in primitive form, at large-scale Internet data centers, which quickly encountered the scaling limitations of traditional SQL databases as the volume of data exploded. Instead, high-performance, scalable/distributed, object-orientated data stores are being developed internally and implemented at scale…large web properties have been building their own so-called “NoSQL” databases, also known as distributed, non-relational database systems (DNRDBMS).

There are several misguided points here. First, there is not going to be a big data or cloud stack. Distributed systems are about making trade offs and a move toward problem-specific solutions rather than one-size-fits-all stacks. Second, enterprises already have their solution—expensive data warehousing and consulting support. Will open source projects like Hadoop supported by people like Cloudera take a chunk of the business? Sure. But as I mentioned earlier, the most interesting part about big data and the cloud is not cheaper alternatives for the enterprise, it is the opportunities it facilitates for data-driven startups.

There is a lot of talk about the NoSQL movement. The big idea here is that distributed systems are hard, require tradeoffs, and sometimes we are better off with data storage and processing that are specific to what we are doing with the data. Sometimes even with a small amount of data on a single node, there are better alternatives to SQL queries and relational databases—time series data has long been a good example.

Processing and Hadoop: The Elephant In The Room

There is a broad range of needs for processing large amounts of data. These range from simple needs like calculations for log analysis that just need to occur at scale, to middle of the road needs like BI, to complex needs like scalable modern machine learning and retrieval systems.

There are a different approaches one can use to service specific needs. Again, we see the pattern of moving away from one-size-fits-all stacks, and toward building for your needs. That said, there are very generic abstractions like Map-Reduce that work well for a lot of use cases. Distributed systems are hard to get right, so when something like Hadoop gets a lot of momentum, it retains that momentum until alternatives have the time to mature enough to solve the hard problems with fault tolerance, performance, and so forth. Not everyone is Leonardo da Vinci, so people should not attempt to create these systems on their own unless they really know what they are doing. In that sense, the cloud and big data are facilitators of open source.

An important aspect of processing at scale is abstraction. Writing complex or even simple computations in raw Map-Reduce is verbose for programmers and intimidating for others who might want to play with the data. Abstractions over Map-Reduce like Pig and Hive make simple things easy, and abstractions like Cascading make hard things possible. The Map-Reduce paradigm, and Hadoop in particular, have been a big success. That said, Map-Reduce is not the only important piece of compute infrastructure. Message queues serve as the backbone of a lot of compute architectures – implementations of AMQP, such as rabbitmq, are a prime example. You can accomplish a lot with producers, consumers, and a messaging system. Distributed storage and processing systems can also be very tricky to configure and deploy, requiring a pretty deep understanding of the system – hence the business case for folks like Cloudera.

Learning from Big Data

The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it

Unfortunately for those of us working on these problems in real life, it is not so simple. The archetypal data-renaissance man is mathematician, statistician, computer scientist, machine learner, and engineer all rolled into one. There are opportunities where you can lack some of these skills and work with a team that supplements your weak points—a startup is not one of those.

Now that we can store so much data, it is attractive to do previously unimaginable things with it. We are sure to see cool applications in fields from the internet to biotechnology to nanotechnology and fundamental materials science research. Almost all advances in every field of science and technology are now heavily dependent upon data and computing. Machine learning is serving a fantastic role as a bridge between mathematical and statistical models and the worlds of AI, computer science, and software engineering. We are exploring applications in learning from text, social networks, data from scientific experiments, and any other data sources we can get our hands on.

The data renaissance does present some difficult issues. There are not many places one can recieve a good education on working on these problems at large scale. Scaling our modeling and optimization algorithms is hard. We need to figure out how to partition and parallelize, or sometimes trade speed and scale for approximately correct calculations. Another issue is that we are often using simplistic models, albeit with pretty good results in many cases. We would like to move toward a deeper approximation of real intelligence.

But the data renaissance is here. Be a part of it.

0

Crunchbase

OverviewCloudera, the commercial Hadoop company, develops and distributes Hadoop, the open source software that powers the data processing engines of the world's largest and most popular web sites.
Founded by leading experts on big data from Facebook, Google, Oracle and Yahoo, Cloudera's mission is to bring the power of Hadoop, MapReduce, and distributed storage to companies of all sizes in the enterprise, …

OverviewThe Apache:tm: Hadoop:registered: project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation …

OverviewFlightCaster predicts flight delays.
FlightCaster can predict your probability of delay hours before the airline or any other app notifies you. FlightCaster uses a patent-pending algorithm that scours data on every domestic flight for the past 10-years and matches it to real-time conditions.