Cloud

Today at Spark Summit, Databricks CEO, Ion Stoica , announced its first product that is DatabricksCloud. This was one among two important announcements from Databricks today. The first one was that Databricks got series B round of funding 33M from Andreeseen Horowitz. But IMO the availability of DatabricksCloud is more significant.

A few months back, I had an opportunity of have teleconference with Ion. During the discussion, while talking on business model, he emphasized more on Databricks’s Spark certification service. He was (rightly though) ambiguous on other developments and business model. After the call, I discussed with a colleague about a possible product around IDE for programmers and data scientists and monitoring of Spark clusters. But what we saw today was much better.

Databricks Cloud has put Apache Spark on Cloud. Big Data on Cloud is not a new thing. However, what Databricks has provided is an interactive, SQL based Web tool for Data Scientists to play with data and visually see the output in different forms like trends, charts, etc. It also provides a powerful WYSIWYG dashboard builder.

With making Spark and Spark streaming available on Cloud, Databricks joins Google and Amazon, both of them have streaming services with analytic stack available on cloud for building real time analytics and dashboard. However, the key difference is that Amazon and Google built those services for programmers to write streaming applications. In contrast, Databricks Cloud is more suitable for Data scientists.

Databricks Cloud has web based interactive tool , called Databricks Notebook. Though the details of the technology it built on is not yet available (in fact Databricks website is still silence on the announcement), the concept and look & feel is astonishingly similar to ipython’s Notebook. And name is similar too!! Is it reusing the Rich client from iPython’s Notebook? Of course, it also seems to be different. A few big differences are: the Databricks Notebook heavily demonstrates the power of using SQL on data. It also make the power of machine learning available to the data scientists. Anyway, looking forward to get to see more details about the service, and pricing.

Anyway, for last couple of months, I was exploring a business viability of Spark Analytics as a Service on Cloud. It just got killed! Good that it happened earlier than later 🙂

One of the main announcements, other than wearables and AndroidL, at Google IO conference yesterday was the availability of streaming analytics service called Dataflow on Google cloud. This service will enable application developers to quickly assemble a real time high volume data ingestion and processing pipeline and make it scalable using Google Cloud infrastructure. Google Dataflow combines Google’s internal Streaming engine called Millwheel with easy to program big data processing abstraction called FlumeJava. The Google Dataflow is compatible to Google BigQuery so that the output from DataFlow can fed to BigQuery. Using this pipeline stack, building applications like real time analytics and dashboarding will be easier.

Google has published papers on both of these technologies a few years back. However, google had not open sourced it. As in the case of other Google’s papers, these papers also had triggered development of similar technologies outside of Google. For example, Cloudera developed Crunch which is based on the concepts from FlumeJava and then open sourced it to Apache. There is an another project called Puma on the same concepts.

On streaming side, there are various other opensources that have come up. Twitter has open sourced its streaming processor called Storm. Amplabs’s (Berkeley University), now popular opensource, Apache Spark has a streaming as one of the important components. Yahoo open sourced S4 to Apache and recently LinkedIn also open sourced Samza which is based on Kafka, a distributed messaging that LinkedIN had earlier open sourced. Sometime later I would compare these streaming technologies on blog on my texploration blog at wordpress.

However, though all these technologies is to solve the same real time stream processing problem that Google Dataflow is solving, none of them are cloud services. That is where Amazon comes in the picture. Google is fighting this war in Cloud.

Back in December 2013, Amazon made available an AWS cloud service called Kinesis for real time streaming data processing. This service makes it quick to assemble application to process massive streaming data that various mobile games or sensors generates. It is cost effective, just 2.8 cents per million records to digest! Of course, Amazon earns money not only from processing but storage and further processing and using of the data. The service can be used for dashboarding as well as real time processing applications.

Google is yet to price to the service. I am not sure whether it would be based on data being processed or I am sure it will be in the same range that of the Amazon.

However, the main obstacle both of them will have is that both of these technologies are not opensource. Hence, it will be a one way entry to customers using it. Vender locking ! Hence, the biggest competition to them would application developers using Spark or Storm etc. on EC2 or Google cloud. MetaMarket had similar problem which it tackled by opensourcing Druid (BTW though Druid can be used for Real Time processing and dashboarding, its architecture and programming model is different than most of the above). I hope Google open sources MillWheel too.

Big Data without algorithms is a dumb data. Algorithms like machine learning, text processing, data mining extract knowledge out of the data and makes it smart data. These algorithms make the data consumable or actionable for humans and businesses. Such actionable data can drive business decisions or predict products that customers most likely to buy next. Amazon and Netflix are popular examples of how the learnings from data can be used for influencing customer decisions. Hence, machine learning algorithms are very important in the era of Big Data. BTW in the field of Big Data, ‘Machine learning’ is considered more broadly ( than what it is really meant by the machine learning professionals) and includes pure statistical algorithms as well as other algorithms that are not based on ‘learning’s.

Earlier today, on 16th June, Microsoft announced a preview of machine learning service called AzureML on its Azure cloud platform. With this service, business analysts may easily apply machine learning algorithms like the ones related to predictive analytics to data.

Machine learning itself has been popular for last few years. Microsoft has recognized the trend and jumped on it. When it comes to big players making machine learning services on cloud, Google had pioneered its PredictionEngine as a service on cloud few years back.

Traditionally data scientists use tools like Matlab, R, Python (NumPym, SciKit, Sklearn) and others for analyzing data. Programmers use open sources like Apache Mahout, Weka for developing Application services using Machine Learning algorithms. However, having machine learning algorithms is not good enough, scaling the machine learning algorithms to big data is very important.

Last year Cloudera did an acqui-hire, Myrrix, and open sourced Machine learning on Hadoop as Oryx. Berkeley’s Ampslab has opensourced its Big Data Machine learning work, called MLBase, in Apache Spark, an open source big data stack becoming rapidly popular.

The momentum in machine learning has already fueled a good amount of venture funding in this area.

Another startup wise.io raised $2.5 from VCs led by Voyager Ventures. Wise.io would makes it easy to predict customer behavior using machine learning.

AlpineLabs that came out of EMC raised series B last year from Sierra Ventures, Mission Ventures and others. It provides a studio and easy to assemble set of standard Machine Learning and analytics algorithms.

Oregon based BigML raised $1.2 million last year to provide easy to use machine learning cloud service.

There is an interesting Machine learning project called Vowpal Wabbit that initially started at Yahoo and continued at Microsoft. However, Interestingly, instead of VW, Microsoft is making R language and algorithms available on Azure Cloud.

Anyway, the trend of making machine learning services easy to run on Big Data and on Cloud would continue. But having the tools and algorithms available would not enough to solve the problem. We need qualified people who understands which algorithms to use for solving which cases and how to use them (parameterize). Moreover, what we really need is applications using such algorithms to solve the business problems without even having a need for users to understand the algorithms. In my opinion , what we would see in future is such vertical applications / services that would abstract (use but hide) machine learning or prediction algorithms to serve domain specific business needs.

Global Indian Technology Professionals Association (GITPRO) is hosting a conference on “Emerging Technologies and Opportunities for Professionals and Entrepreneurs” on 18th Feb 2012 at Palo Alto, CA. With three parallel tracks focused on Technology, Career & Leadership and Startups, this conference is best suitable for everybody from technology to entrepreneurs.

Iconic serial entrepreneur and entrepreneurship coach at Stanford University, Steve Blank would be delivering a keynote. The CEO of Persistent Systems, Anand Deshpande, would be delivering keynotes at the conference.

The Technology track is full of experts on Big Data, Hadoop, Cloud, Mobile and Social Computing. They are coming from Greenplums, Cloudera, HortonWorks, Microsoft, IBM, ThisMoment, AdMaxim, and GloMantra.

The Startup Bootcamp at GITPRO World 2012 would cover everything that an entrepreneur should know from launching a startup to a successful exit. Successful startup entrepreneurs, VCs, sales & marketing executives would be guiding aspiring entrepreneurs.

The GITPRO World 2012 has sessions specially focuses on career and leadership related topics covering various aspects like managing with influence, evolving from individual role to manager and leader, mid career accelerators & Mid-Career Switch and job opportunities in 2012.

This event would provide a unique opportunity for Indian Technology professionals for networking with fellow professionals, technical experts, industry leaders, entrepreneurs, career coaches, and VCs.

GITPRO is a global networking platform for Indian Technology Professionals for their professional and self-development and their contribution back to the profession, society, and people of US and India. GITPRO, started in 2009, has chapters in Silicon Valley, Contra Costa Valley, Seattle, DC, Denver in US and Bangalore, Hyderabad, Pune in India.