Social, Mobile and application logs have been the three main contributors to Big Data. To handle such data, many stakeholders and experts are promoting a Data Lake Architecture in which the data collected is brought to a central platform where it is stored, processed and made it available for interactive analysis as well as for applications.

Recently Internet Of Things(IoT) joined the party. It is clear that IoT would take this the data explosion to the next level. As per a report there will be 26billion IoT devices installed by 2020. Wearable devices, sensor devices, home appliances, etc would be producing huge machine data.

The synergy of IoT and Big Data is obvious. However, there is an incompatibility too, especially when it comes to promoting Data Lake architecture to IoT. The IoT data would be structured data continuously streamed in but not all data tupples would be interesting enough for a long term storage. The machine generated data would not only be huge, but its utility to store centrally would be limited.

There are two types of processing would commonly be done on IoT data:

1. Actionable processing: often, locally and immediately

2. Batch processing for patterns, aggregates and learnings : Both locally and centrally.

The local processing would need to be done to identify any anomaly or changes that detecting a change of state of the observables. For example, machine failure or in case of Google’s Nest , a departure of home people indicating a need of stopping temperature control or switching off lights. Other type of local processing is to learn patterns from the usage. However, there is not much value to bring the entire data centrally to process.

Hence, IoT data processing architecture would need to have a combination of local ‘smart’ processor as well as a central Big Data processor. In this case, the data from all local sensors can be processed by Local processors and send aggregated data, insights and, in some cases, sample data to central Big Data platform. The processors on edges can collect data from local sensors, process them, learn from them and if needed act on the insights.

In this architecture, the central big data platform would receive processed and some sample data from all local processors at some interval. The central Big Data platform would further do aggregation, pattern recognition as well as machine learning from the data from all the sources. Such processing can not only identify and notify local anomalies but can detect and learn from patterns in the entire network. The central processing would be able to identify clusters of behavioral patterns and trends. Some of the insights identified by the central processing would also be useful for edge processing particularly to compare the local behavior with global behavior.

Hence, IoT architecture would involve ‘intelligence’ data processing both of on local networks where data will be small as well as centrally where the data will be huge not because the entire data is coming to it but because the data would be coming from large types of small networks. A key thing to remember is that local processing may be working on a small data but it would still be intelligent processing. Google’s Nest is a classic example of such processing.

One of the main announcements, other than wearables and AndroidL, at Google IO conference yesterday was the availability of streaming analytics service called Dataflow on Google cloud. This service will enable application developers to quickly assemble a real time high volume data ingestion and processing pipeline and make it scalable using Google Cloud infrastructure. Google Dataflow combines Google’s internal Streaming engine called Millwheel with easy to program big data processing abstraction called FlumeJava. The Google Dataflow is compatible to Google BigQuery so that the output from DataFlow can fed to BigQuery. Using this pipeline stack, building applications like real time analytics and dashboarding will be easier.

Google has published papers on both of these technologies a few years back. However, google had not open sourced it. As in the case of other Google’s papers, these papers also had triggered development of similar technologies outside of Google. For example, Cloudera developed Crunch which is based on the concepts from FlumeJava and then open sourced it to Apache. There is an another project called Puma on the same concepts.

On streaming side, there are various other opensources that have come up. Twitter has open sourced its streaming processor called Storm. Amplabs’s (Berkeley University), now popular opensource, Apache Spark has a streaming as one of the important components. Yahoo open sourced S4 to Apache and recently LinkedIn also open sourced Samza which is based on Kafka, a distributed messaging that LinkedIN had earlier open sourced. Sometime later I would compare these streaming technologies on blog on my texploration blog at wordpress.

However, though all these technologies is to solve the same real time stream processing problem that Google Dataflow is solving, none of them are cloud services. That is where Amazon comes in the picture. Google is fighting this war in Cloud.

Back in December 2013, Amazon made available an AWS cloud service called Kinesis for real time streaming data processing. This service makes it quick to assemble application to process massive streaming data that various mobile games or sensors generates. It is cost effective, just 2.8 cents per million records to digest! Of course, Amazon earns money not only from processing but storage and further processing and using of the data. The service can be used for dashboarding as well as real time processing applications.

Google is yet to price to the service. I am not sure whether it would be based on data being processed or I am sure it will be in the same range that of the Amazon.

However, the main obstacle both of them will have is that both of these technologies are not opensource. Hence, it will be a one way entry to customers using it. Vender locking ! Hence, the biggest competition to them would application developers using Spark or Storm etc. on EC2 or Google cloud. MetaMarket had similar problem which it tackled by opensourcing Druid (BTW though Druid can be used for Real Time processing and dashboarding, its architecture and programming model is different than most of the above). I hope Google open sources MillWheel too.

Big Data has been disruptive movement that has caused a disruption in storage world. This big data Disruption has created a big opportunity for innovation and profits for storage industry and thus giving birth to many startups.

One of the main obvious thing about Big Data is the big data needs a big storage. And the performance of the big data processing depends on the data storage and data movement.

There are three well discussed key data related aspects of big data:

1. Stored on commodity hardware including storage

2. Data locality: One of the key architectural aspect is to move processing code to the data instead of moving data to the computing node.

3. Replication based fault tolerance. A typical replication factor is three that results into a need of three times of storage space.

These three principles of Big Data have caused a need for challenges that needed to met with innovation. Obviously, established vendors reacted to this with two approaches, in-house R&D innovation as well as acquisitions.

Typical problems associated with the data storage and movements are: Size, Access speed, Data movement pipe, etc. The Size problem is dealt with optimized compression, de-duplications. For example, to deal with a problem of volume of data, Dell had bought Ocarina which does storage optimization with compression and de-duplication.

The storage access performance dealt with faster media technologies SSD, flash etc. With the active interest in various in-memory databases like SAP HANA, as well as computing like Spark that heavily dependent on memory, there is an active interest in SSD and Flash based memory. Last month, May 2014, EMC acquired a privately funded DSSD which makes a rack scale flash storage which is better suitable for IO intensive operations like in-memory database. EMC has invested in this startup early on. Some notable startups in the area of Flash based technologies are iSCSI, Nimble Storage, Amplidata, VelocityIO, Coraid, etc.

Many accelerators or faster access pipes/switches deal with data movement issue. A couple of years back Netapp bought CacheIQ which was a NAS accelerator specifically for caching. Last year Violin Memory acquired GridIron which is a flash cache based SAN accelerator.

To take this further, innovative startups like Nutanix, Tintri, etc. are providing software defined (virtualized ) storage. These startups are quickly followed by the existing players. In March this year, VMware announced VSAN, virtual SAN, based on the principles of Software Defined Storage. Earlier EMC has also acquired ScaleIO.

Hadoop’s HDFS itself got enhanced to take advantage of these variety of storage types. HDFS 2.3 release (April 2014) has been significant in this respect. From this release, HDFS has a support for in-memory caching and heterogeneous storage hierarchy. We will continue to see innovation in storage technologies. Choosing a right combination, configuring and managing it would be an important task for big data deployments.

BTW this post is not claimed to be a comprehensive survey of storage technologies or startups. The names I mentioned are just to give example to make a point. I have not covered all the players and the names I mentioned are not necessarily preferred choices.

Big Data without algorithms is a dumb data. Algorithms like machine learning, text processing, data mining extract knowledge out of the data and makes it smart data. These algorithms make the data consumable or actionable for humans and businesses. Such actionable data can drive business decisions or predict products that customers most likely to buy next. Amazon and Netflix are popular examples of how the learnings from data can be used for influencing customer decisions. Hence, machine learning algorithms are very important in the era of Big Data. BTW in the field of Big Data, ‘Machine learning’ is considered more broadly ( than what it is really meant by the machine learning professionals) and includes pure statistical algorithms as well as other algorithms that are not based on ‘learning’s.

Earlier today, on 16th June, Microsoft announced a preview of machine learning service called AzureML on its Azure cloud platform. With this service, business analysts may easily apply machine learning algorithms like the ones related to predictive analytics to data.

Machine learning itself has been popular for last few years. Microsoft has recognized the trend and jumped on it. When it comes to big players making machine learning services on cloud, Google had pioneered its PredictionEngine as a service on cloud few years back.

Traditionally data scientists use tools like Matlab, R, Python (NumPym, SciKit, Sklearn) and others for analyzing data. Programmers use open sources like Apache Mahout, Weka for developing Application services using Machine Learning algorithms. However, having machine learning algorithms is not good enough, scaling the machine learning algorithms to big data is very important.

Last year Cloudera did an acqui-hire, Myrrix, and open sourced Machine learning on Hadoop as Oryx. Berkeley’s Ampslab has opensourced its Big Data Machine learning work, called MLBase, in Apache Spark, an open source big data stack becoming rapidly popular.

The momentum in machine learning has already fueled a good amount of venture funding in this area.

Another startup wise.io raised $2.5 from VCs led by Voyager Ventures. Wise.io would makes it easy to predict customer behavior using machine learning.

AlpineLabs that came out of EMC raised series B last year from Sierra Ventures, Mission Ventures and others. It provides a studio and easy to assemble set of standard Machine Learning and analytics algorithms.

Oregon based BigML raised $1.2 million last year to provide easy to use machine learning cloud service.

There is an interesting Machine learning project called Vowpal Wabbit that initially started at Yahoo and continued at Microsoft. However, Interestingly, instead of VW, Microsoft is making R language and algorithms available on Azure Cloud.

Anyway, the trend of making machine learning services easy to run on Big Data and on Cloud would continue. But having the tools and algorithms available would not enough to solve the problem. We need qualified people who understands which algorithms to use for solving which cases and how to use them (parameterize). Moreover, what we really need is applications using such algorithms to solve the business problems without even having a need for users to understand the algorithms. In my opinion , what we would see in future is such vertical applications / services that would abstract (use but hide) machine learning or prediction algorithms to serve domain specific business needs.

Facebook today announced Presto as a distributed SQL query engine optimized for ad-hoc analysis at interactive speed. Facebook developed even though Hive provides a query interface over big data because of a need of having lower latency interactive queries over web scale of data that Facebook has. By looking into the results, it seems that Facebook engineering has successfully able to accomplish this mission . It is deployed in three (may be more) geographical regions with scaled a single cluster to 1,000 nodes. This is being used by 1000+ employees firing more than 30,000 (up 3000 since June when Presto was first time revealed at WebScale event) over petabyte of data.

So, what does it do differently?

1. Optimized query planner and scheduler firing multiple stages

2. Stages are executed in parallel

3. Intermediate results are kept in memory as against persisting on HDFS thus saving IO cost

More importantly, when contrasting with Hive, Presto does not use MapReduce for query processing.

One thing I liked about Presto is that it is built on a pluggable architecture. It can work on with Hive, Scrub, and potentially your own storage. That opens up a good opportunity for its adaption. Of course, we need to compare this with Impala from Cloudera.

In the Web Scale conference at Facebook menlo Park office back in the June, it was told that Presto would explore probabilistic sampling for quicker results at error (concept that BlinkDB implemented). I am not sure where is it, however, BlinkDb already supports Presto in addition to Hive and Shark.

Facebook engineering revealed the design behind its humoungous social graph that handles ” thousands of data types and handles over a billion read requests and millions of write requests every second.” Its called TAO (The Associated Objects), a homegrown project that Facebook had initiated in 2009 and is in production for a while. This project has done a massive lifting of load behind Facebook’s key value of connecting the objects and putting them on social graph. All the likes, comments and connections are captured via this TAO. Here is a paper explaining the more details about TAO

The TAO service runs across a collection of server clusters geographically distributed and organized logically as a tree. Separate clusters are used for storing objects and associations persistently, and for caching them in RAM and Flash memory.

Client requests are always sent to caching clusters running TAO servers. In addition to satisfying most read requests from a write-through cache, TAO servers orchestrate the execution of writes and maintain cache consistency among all TAO clusters. We continue to use MySQL to manage persistent storage for TAO objects and associations.

The data set managed by TAO is partitioned into hundreds of thousands of shards. All objects and associations in the same shard are stored persistently in the same MySQL database, and are cached on the same set of servers in each caching cluster

As per Facebook software engineer Mark Marchukov who posted the blog:”There are two tiers of caching clusters in each geographical region. Clients talk to the first tier, called followers. If a cache miss occurs on the follower, the follower attempts to fill its cache from a second tier, called a leader. Leaders talk directly to a MySQL cluster in that region. All TAO writes go through followers to leaders. Caches are updated as the reply to a successful write propagates back down the chain of clusters. Leaders are responsible for maintaining cache consistency within a region. They also act as secondary caches, with an option to cache objects and associations in Flash. Last but not least, they provide an additional safety net to protect the persistent store during planned or unplanned outages.”