This week felt like having Christmas and Birthday! On Monday Google open sourced their deep learning framework and on Friday Microsoft open sourced their distributed machine learning framework. First I thought they are competing projects!

After I skimmed through both projects I can say, they follow different approaches to tackle the problem of distributed machine learning. Microsofts DMTK (Distributed Machine Learning Toolkit) is an approach to distribute large, highly dimensional data for model training. At first glance, they use special sampling techniques to create and distribute training data throughout the cluster. The intention of DMTK is to provide a framework to build distributed algorithms on top – like the word embedding algorithm within the project. Right now they do not provide any samples or tutorials.

Googles TensorFlow is designed as a deep learning framework. Deep learning is a relatively new technology which had delivered very good results in handwriting, speech and image recognition. A few years back deep learning crushed the state of the art technologies in MNIST handwriting test. Today deep learning is used in speech recognition technologies like Siri, Cortana & Co. Ask your phone – it really works 😉

The tutorials and examples on tensorflow.org are very well written and very insightful! The architecture of TensorFlow is smart. TensorFlow uses an operations graph on which data is applied. A similar concept is used by Apache Spark and Apache Flink for data processing. Right now TensorFlow only supports single computer execution. Google plans to release a distributed version of TensorFlow to operate in clusters.

In my personal opinion TensorFlow looks a lot like Googles dataflow with focus on deep learning. Dataflow competes with Spark. We know that Spark has issues with memory and because it uses JVM it has some CPU overhead. On the other hand, TensorFlow is written in C++, which will be better in memory usage and CPU utilization (+ supports GPU). As Spark uses Mini-Batches, TensorFlow uses tensors to transfer and process data. The difference is that tensors only support integers, floating point numbers and strings (the primary data types for machine learning). My personal hypothesis is that TensorFlow was open sourced to define the standard for deep learning (the hottest topic in machine learning right now). If the standard is accepted, than Googles Cloud provides the computation resources for production deployments of TensorFlow. So Microsoft does not want to leave the market completely to Google and they answered with open sourcing their platform. 😉

The dynamics of Big Data have pushed the development of many new SQL-like engines on top of Hadoop in the recent years. They were built for different applications and with different purposes in mind. This post summarizes the most popular engines, which are currently available or in development with both open source and commercial licenses.

Apache Hive

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Hive-on-Tez

Hive-on-Tez uses the Tez application framework to optimize the execution of the MapReduce jobs.

Hive-on-Spark

Hive-on-Spark enables Hive to run on top of Spark. (Still under development)

Apache Pig

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

Apache Spark is a fast and general engine for large-scale data processing. Spark SQL is Spark’s module for working with structured data.

Cloudera Impala

Impala is a fully integrated, state-of-the-art analytic database architected specifically to leverage the flexibility and scalability strengths of Hadoop – combining the familiar SQL support and multi-user performance of a traditional analytic database with the rock-solid foundation of open source Apache Hadoop and the production-grade security and management extensions of Cloudera Enterprise.

Apache Drill

Drill is an innovative distributed SQL engine designed to enable data exploration and analytics on non-relational datastores. Users can query the data using standard SQL and BI tools without having to create and manage schemas. Some of the key features are:

Schema-free JSON document model similar to MongoDB and Elasticsearch

Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs

Extremely user and developer friendly

Pluggable architecture enables connectivity to multiple datastores

Apache Tajo

Apache Tajo is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources. By supporting SQL standards and leveraging advanced database techniques, Tajo allows direct control of distributed execution and data flow across a variety of query evaluation strategies and optimization opportunities.

Apache Phoenix

Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.

Phoenix-on-Spark

The phoenix-spark plugin extends Phoenix’s MapReduce support to allow Spark to load Phoenix tables as RDDs or DataFrames, and enables persisting them back to Phoenix.

Facebook Presto

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

Apache Flink

Apache Flink is an open source platform for scalable batch and stream data processing. Includes Table API with a SQL-like expression language embedded in Java and Scala. (still in development)

Apache Kylin

Apache Kylin is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets, original contributed from eBay Inc.

Apache MRQL

MRQL is a query processing and optimization system for large-scale, distributed data analysis, built on top of Apache Hadoop, Hama, Spark, and Flink. MRQL (pronounced miracle) is a query processing and optimization system for large-scale, distributed data analysis. MRQL (the MapReduce Query Language) is an SQL-like query language for large-scale data analysis on a cluster of computers. The MRQL query processing system can evaluate MRQL queries in four modes:

in Map-Reduce mode using Apache Hadoop,

in BSP mode (Bulk Synchronous Parallel mode) using Apache Hama,

in Spark mode using Apache Spark, and

in Flink mode using Apache Flink.

IBM Big SQL

Big SQL leverages IBM’s strength in SQL engines to provide ANSI SQL access to data across any system from Hadoop, via JDBC or ODBC – seamlessly whether that data exists in Hadoop or a relational data base. This means that developers familiar with the SQL programming language can access data in Hadoop without having to learn new languages or skills.

Pivotal HAWQ

HAWQ is an advanced enterprise SQL on Hadoop analytic engine built around a robust and high-performance massively-parallel processing (MPP) SQL framework evolved from Pivotal Greenplum Database?. HAWQ runs natively on Apache Hadoop? clusters by tightly integrating with HDFS and YARN. HAWQ supports multiple Hadoop file formats such as Apache Parquet, native HDFS, and Apache Avro. HAWQ is configured and managed as a Hadoop service in Apache Ambari. HAWQ is 100% ANSI SQL compliant (supporting ANSI SQL-92, SQL-99, and SQL-2003, plus OLAP extensions) and supports open database connectivity (ODBC) and Java database connectivity (JDBC), as well. Most business intelligence, data analysis and data visualization tools work with HAWQ out of the box without the need for specialized drivers. (Proposal for Apache Incubator)

Microsoft PolyBase

PolyBase allows you to use T-SQL statements to access data stored in Hadoop or Azure Blob Storage and query it in an adhoc fashion. It also lets you query semi-structured data and join the results with relational data sets stored in SQL Server. PolyBase is optimized for data warehousing workloads and intended for analytical query scenarios.

Teradata Aster SQL-MapReduce

SQL-MapReduce is a framework created by Teradata Aster to allow developers to write powerful and highly expressive SQL-MapReduce functions in languages such as Java, C#, Python, C++, and R and push them into the discovery platform for high performance analytics. Analysts can then invoke SQL-MapReduce functions using standard SQL or R through Aster Database, the first discovery platform that allows applications to be fully embedded within the database engine to enable ultra-fast, deep analysis of massive data sets.

Extends Oracle SQL to Hadoop and NoSQL and the security of Oracle Database to all your data. It also includes a unique Smart Scan service that minimizes data movement and maximizes performance.

Cascading Lingual

Cascading Lingual is a powerful extension to Cascading that simplifies application development and integration by providing an ANSI SQL interface for Apache Hadoop. Now you can connect existing business intelligence (BI) tools, optimize computing costs, and accelerate application development with Hadoop.

RainStor

RainStor takes it a step further and provides an end-to-end application that runs on Hadoop 2.0, enabling organizations to get up and running quickly and achieve faster time to business value. RainStor has supported Hadoop since 2011 and in fact was the first to announce native SQL on Hadoop and building upon these capabilities also provided enterprise-grade security since mid 2013. Extending upon that, RainStor 6 delivers full management capabilities that include YARN integration making it a first class citizen on Hadoop.

Splout SQL

Splout allows serving an arbitrarily big datasetwith high QPS rates and at the same time providesfull SQL query syntax. Splout is appropriated for web page serving, many low latency lookups, scenarios such as mobile or web applications with demanding performance.

This week Facebook open sourced a project called osquery, which offers the ability to access low-level operating system information through simple SQL queries (more precisely SQL as understood by SQLite). More information for how to navigate through the tables can be found in the github page.

make deps will take care of installing everything you need to compile osquery.

If you have any errors in your source list make deps will end with errors and osquery will not be installed, because the used packages are not available. Therefore make sure that you have the latest packages and don’t get any errors in the source.list: sudo apt-get update (also sudo apt-get upgrade). In case of errors, you can fix the source.list by editing: sudo gedit /etc/apt/sources.list