Big Data, an ecosystem

The Big Data ecosystem consists of multiple products, each of them having some or all the characteristics we were evoking in our definitions. Not only they will handle high volumes of data, but they will allow you to process data without specific schema (Variety) quickly (Velocity). This is a list of products and a short description of what they do.

Management layer

Apache Ambari

An Hadoop cluster management server, with Web GUI to setup HDP or HDF.

Ambari is an open source project from the Apache Foundation. You can download it for free from the main web site, http://ambari.apache.org. But Ambari is also the core of the Big Data platforms delivered by the company HortonWorks (http://www.hortonworks.com). HortonWorks. If you install an Ambari Server instance, whether you get the software from HortonWorks or the Apache Foundation web site, you are setting up the same thing. HortonWorks is participating into the development of Ambari and of all the software installed and managed thru it.
With Ambari, you will install an Hadoop cluster, being an HortonWorks Data Platform. Or you can use it to setup an HortonWorks Data Flow (the platform build around Apache NiFi).
Most of the software solutions listed here after can be installed and managed by Ambari (HDFS, Yarn, Zookeeper, Sqoop, MapReduce, Mahout, TEZ, Pig, Oozie, Hive, HBase, SOLR, NiFi, Kafka, Storm, Ranger, Flume, ...)
Note that the company HortonWorks is one of the major contributors to the various Hadoop open source projects.

Data storage layer

Tools to help you to store your (big) data
There will be more information regarding this layer, especially about the NoSQL and Indexing data stores enumerated here in the section about the NoSQL data stores.

Apache Hadoop Distributed File System (HDFS)

Core of the Hadoop framework with YARN and MapReduce

A distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
It is one the 4 four core components of the Hadoop cluster:

Hadoop Common – contains libraries and utilities needed by other Hadoop modules;

HDFS

Hadoop YARN - a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications

Hadoop MapReduce - an implementation of the MapReduce programming model for large scale data processing based in YARN

All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework.

A column-oriented DBMS is a database management system (DBMS) that stores data tables as sections of columns of data rather than as rows of data (Wikipedia)
A database shard is a horizontal partition of data in a database or search engine. Each individual partition is referred to as a shard or database shard. Each shard is held on a separate database server instance, to spread load. (Wikipedia)

MongoDB

Scalable NoSQL database designed for OLTP workload

The disproportionate success of MongoDB is largely based on its innovation as a data structure store that lets us more easily and expressively model the "things" at the heart of our applications….
Having the same basic data model in our code and in the database is the superior method for most use cases, as it dramatically simplifies the task of application development, and eliminates the layers of complex mapping code that are otherwise required.

Easy out-of-the-box experience

Can take advantage of Cloud Infrastructure (to easy deploy new nodes for scaling out)

Document based, not pure keys-values

Automatic backup capabilities

Replication

Designed for OLTP workloads, not for complex transactions

Support for automatic sharding

On Line Transaction Processing applications are high throughput- and insert- or update-intensive in database management. These applications are used concurrently by hundreds of users. (Wikipedia)

Elastic (Elasticsearch)

Tool to build a distributed search engine, using Apache Lucene for the indexing and a REST API and JSON data format to allow it to be used by any kind of programming language.
It is also multi-tenant, near real-time and able to do full-text search.
Able to search for any kind of documents.
Distributed search engine but lacking distributed transactions.
As such, Elastic itself is also an ecosystem build around the distributed search engine, including:

Analytic layer

Tools that perform what your business applications and users are requiring.

Anaconda

A framework to build, distribute and execute application performing analytic and visualization that will allow you to perform any kind of analyse of multiple set of data across a variety of systems.

It is based on Python, offering a commercial support.
For batches, interactive or real-time processing of huge amount of data.
Looks more like a framework to develop application or bundle of applications that needs to be deployed somewhere to be executed, like on a Hadoop cluster. Filled with a lot of “little” components that can help you to build all you’re in need.
Include various packages, based on Python or the R language. It even integrate ElasticSearch.R Studio can be used, GIS data can be ingested.
Help you to develop, build, version, deploy and run your code. Because it integrates / includes a lot of various other projects, can be used to do almost everything from the deployment and management of a Hadoop cluster to the execution of applications performing complex data analytics and visualization, proposing interface to dig into the Big Data, …

Apache Drill

Apache Drill is a low latency distributed query engine for large-scale datasets, including structured and semi-structured/nested data.
For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop.
You can thus query different datastores simultaneously and join their results into the same resultset:

MongoDB

HDFS

Hive

MapR-DB & MapR-FS

HBase

Amazon S3

Microsoft Azure Blob Storage

Google Cloud Storage

Openstack Swift

NAS storage

Local files storage

Provide a REST API to integrate any applications or programming language.
On top of Drill, you can put your own Business, Analyst or Data Science tool, like MicroStrategy, SAS, D3, Excel, … to interact with Drill’s underlying non-SQL datastores.
It runs as a single application on your laptop or on a multiple nodes cluster for a distributed implementation of the engine.

Apache Hive

Pseudo-SQL engine to give structure to unstructured content

The Apache Hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.
At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Apache Phoenix

A pure SQL layer on top of HBase

Phoenix is an open source SQL skin for HBase. You use the standard JDBC APIs instead of the regular HBase client APIs to create tables, insert data, and query your HBase data.
On top of that, using JDBC and SQL:

Are already well know and used technologies that can help to spread more HBase usage among enterprises.

Reduces the amount of code users need to write

Allows for performance optimizations transparent to the user

Opens the door for leveraging and integrating lots of existing tooling

Apache Pig

Script language and framework to create MapReduce tasks on an Hadoop cluster

Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems. Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.

Apache TEZ

Framework to allow the creation of application running complex set of tasks for processing data

By allowing projects like Apache Hive and Apache Pig to run a complex DAG of tasks, Tez can be used to process data, that earlier took multiple MapReduce jobs, now in a single Tez job, as show below:
Each circle represent a distinct job. 3 when using a pure Map-Reduce implementation, 1 when using TEZ.

Apache Storm

A real-time computational engine on a distributed infrastructure

Storm has many use cases: real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
Storm integrates with the queuing and database technologies you already use. A Storm topology consumes streams of data and process (transform, compute, …) those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation however needed.

Apache Sqoop

The link between unstructured datastore (Hadoop) and structured one (RDBMS)

Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Imports can also be used to populate tables in Hive or HBase. Exports can be used to put data from Hadoop into a relational database.

Apache Kafka

It is a real-time (low latency) queue (messaging) platform with high-throughput.
As any message queuing system, the purpose is to handle incoming message expressed in one protocol into the formal protocol expected by the receiver. A commit mechanism is in place to assure the sender that the message has been really delivered.

Apache NiFi

Distributed tool to Extract, Transform and Load data in a flow-based model with a graphical interface

Apache NiFi is a software project which enables the automation of data flow between systems. NiFi helps move and track data. The project is written using flow-based programming and provides a web-based user interface to manage data flows in real time.
Each process you define in one flow is independent of the others. Processes exchange data in the form of a so-called flow file, containing the data send to the flow and some attributes (meta-data).
Processes can be:

Visualization layer

The tools that focus more on presenting the data than allowing complex analyze and transformation on them.

D3

Generic visualization of data in web browser

Javascript library allowing the presentation of data in the form of a web page using HTML, CSS and SVG. It is based on web standard and can use the full capabilities of all modern web browsers.
E.g.: generate an HTML table from an array of numbers.
Its goal: efficient manipulation of documents based on data
The data have to be loaded and passed to the web server where D3 is installed and used to represent it.

QGIS

AN open source set of libraries and tools to create, edit, visualise, analyse and publish geospatial information.

View data

Explore data and compose maps

Create, edit, manage and export data

Analyse data

Publish maps on the Internet

This is a tools supporting functionality extensions by a plugin architecture => create your own plugin if you want to add specific features. Plugins can be written using Python. The core of the application is written in C++ using the QT framework.
This means that QGIS is a desktop tool, with a (nice) GUI to help you to create your geographical representation.

Apache Zeppelin

This is an implementation of a Notebook interface (also called a Computational notebook or Data science notebook). This is a virtual notebook environment used for literate programming (programming in a natural language). It pairs the functionality of word processing software with the programming language of the notebook.
Used for various purposes, like:

Programming languages

The programming languages often used to develop applications or jobs, sitting in any of the above mentioned layers

R environment

A programming language oriented to analytics and his Integrated Developing Environment

R is a free software environment for statistical computing and graphics. It is a GNU project.
One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed.
R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes

an effective data handling and storage facility,

a suite of operators for calculations on arrays, in particular matrices,

Python

A scripting and programming language, not related to analytics or Big Data as such

It works as a script (interpreted) or as a binary program (compiled).
It comes with tons of library to extend the core of the language.
This is not specifically oriented to the Big Data & analytics world.

Ruby

Another programming language, not related to analytics or Big Data as such

A dynamic, open source programming language with a focus on simplicity and productivity. It has an elegant syntax that is natural to read and easy to write.
This is not specifically oriented to the Big Data & analytics world.