Big Data Analytics (BDA) comes into the picture when we are dealing with the enormous amount of data that is being generated from the past 10 years with the advancement of the science and technology in different fields. To process this large amount of data and getting valuable meaning from it in a short span of time is a really challenging Task. Especially when four V’s that comes into the picture, when we discuss about BDA i.e. Volume, Velocity, Variety and Veracity of data.

Hadoop Distributed Framework is designed to handle large data sets. It can scale out to several thousands of nodes and process enormous amount of data in Parallel Distributed Approach. Apache Hadoop consists of two components. First one is HDFS (Hadoop Distributed File System) and the second component is Map Reduce (MR). Hadoop is write once and read many times.

Hive is developed on top of Hadoop as its data warehouse framework for querying and analysis of data that is stored in HDFS. Hive is an open source-software that lets programmers analyze large data sets on Hadoop. Hive make the operations like ad-hoc queries, huge data-set analysis and data encapsulation execute faster.

Apache Pig is developed on top of Hadoop. It provides the data flowing environment for processing large sets of data. The Apache pig provides a high level language. It is an alternative abstraction on top of Map Reduce (MR). Pig program supports parallelization mechanism. For scripting of the Pig, it provides Pig Latin language.

Hbase is a column oriented distributed database in Hadoop environment. It can store massive amounts of data from terabytes to petabytes. Hbase is scalable, distributed big data storage on top of the Hadoop eco system. If we compare HBase with traditional relational databases, it posses some special features. It is built for low latency operations.

MapReduce is mainly used for parallel processing of large sets of data stored in Hadoop cluster. Initially, it is a hypothesis specially designed by Google to provide parallelism, data distribution and fault-tolerance. MR processes data in the form of key-value pairs. A key-value (KV) pair is a mapping element between two linked data items - key and its value.

Hadoop ecosystem comprises of services like HDFS, Map reduce for storing and processing large amount of data sets. In addition to services there are several tools provided in ecosystem to perform different type data modeling operations. Ecosystem consists of hive for querying and fetching the data that's stored in HDFS.

Hadoop distributed file system (HDFS) helps us to store data in a distributed environment and due to its superior design. We can store and process its file system on a standard machine, compared to existing distributed systems, which requires high end machines to storage and processing. It also provides a high level of fault tolerance, since the data is replaced into 3 nodes. Even if one node goes down, the other nodes will act as a backup recovery mechanism

BDA plays vital role in different organizations to solve their business problems. So the successful implementation of perfect and accurate big data analytics solution in the specified organization depends on understanding the business problem. Validating the business use case for big data according to the technology available is the next step of the design phase.

In big data analytics Hadoop plays vital role in solving typical business problems and gives the particular domain provides peculiar business options. In a Hadoop eco system each component plays a unique role in data processing, data validation and data storing. MongoDB and Cassandra are NoSQL databases which acts and provide certain features like a database for storing huge amount of different type of data formats.

Cassandra is a NoSQL database which is designed in a manner to handle large amount of data present across several nodes in cluster setup. It is a distributed master-less database. It comes under Key-Value type NoSQL storage, which provides schema less model. In this the data will be stored in the Key-Value format.

Python with Apache Hadoop is used to store, process, and analyze incredibly large data sets. For streaming applications, we use Python to write map reduce programs to run on Hadoop cluster. Hadoop has become a standard in a distributed data processing, but relied on Java in the past. Today, there are a numerous open source projects that support Hadoop in Python. Python supports other Hadoop ecosystem projects and its components such as HBase, Hive, Spark, Storm, Flume, Accumulo, and a few others.