Open Source, Data Science, Startups, and Life

Analytics Stack

With new technologies coming out so fast for analytics, its hard to keep up with the best tool for the job. Take Berkely’s Data Analytics Stack (BDAS) featuring Spark, Shark, Mesos, for advanced analytics and mining. Should I use this or stick with Apache Hadoop, Hive, and Mahout? How do you decide? From my experience, I’ve found this to be the most common stack:

Configuration:

Hadoop: for distributed file system for data collection.

Database: Hbase or Cassandra to enable random reads

Analysis: Hive, Pig, Impala for advanced analysis

Real-Time: Storm or Spark

Visualization: Tableau Software or if you have programmers D3.JS

Applications: Datameer, Alpine Data Labs, WibiData, Wise.io, others?

Infrastructure: On-premise or Hosted?

Add-ons: Hue, Sqoop, and Flume.

Example of a possible configuration

Is this generally what you see? Are there additional configuration I am missing? Feel free to leave a comment or contact me directly.