Blog

25 key tools in Apache’s 15 years history

The very first time when I watched Harry potter’s Chamber of secrets I was like awe and then due to the curiosity I jumped to the next part skipping the first part and gradually my watch list was not in order, it was more of a random pick. But once the last part-deathly hallows-2 was released, I spent a whole day watching the entire series in its sequence. When I completed the series it gave me a satisfaction and pleasure, which I could never forget in my life. And it is always difficult to get something in order, once we did it, it will be the happiest moment of our day. I experienced this happiness once again when I completed this A-to-Z listicles of Apache’s Analytics tools.

In 2015, Apache has launched Stable releases of few Big Data tools and that was a remarkable achievement. Most of the big data tools are written in either Java or Scala, only Apache Flink is written in both Java and Scala and it is just a month old kid.

We have a team of techies who emphasizes on the latest tools in the Big Data and believe me that list was really really good. Those tools started impacting a lot in software sector within 2 to 3 months of their release. So, thought of sharing this awesome list with you.

Apache Big Data Tools:

Cluster management and governance:

Apache Ambari is a software project of the Apache Software Foundation, is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs. Ambari was a sub-project of Hadoopbut is now a top-level project in its own right.

Data Ingestion:

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. The design is heavily influenced by transaction logs.

Data Ingestion – Import / Export:

Apache Sqoop is a command-line interface application for transferring data between relational databases and Hadoop.[1] It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Imports can also be used to populate tables in Hive or HBase.

Data processing:

Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.

Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s in-memory primitives provide performance up to 100 times faster for certain applications. Spark is well-suited to machine learning algorithms.

Apache Storm is a distributed computation framework written predominantly in the Clojure programming language. Originally created by Nathan Marz and team at BackType the project was open sourced after being acquired by Twitter. It uses custom created “spouts” and “bolts” to define information sources and manipulations to allow batch, distributed processing of streaming data.

Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN

Data Processing – Query Framework:

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google’s Dremel system which is available as an infrastructure service called Google BigQuery.

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization MapReduce on Amazon Web Services., query, and analysis. While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix. Amazon maintains a software fork of Apache Hive that is included in Amazon Elastic

Data Processing – Batch and Stream Data:

Apache Flink, like Hadoop and Spark, is a community-driven open source framework for distributed Big Data Analytics. The core of Apache Flink is a distributed streaming dataflow engine written in Java and Scala. It aims to bridge the gap between MapReduce-like systems and shared-nothing parallel data base systems. Therefore, Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner.

Data processing – Query Engine:

Apache Phoenix is an open source, massively parallel, relational database layer on top of NoSQL stores such as Apache HBase. Phoenix provides a JDBC driver that hides the intricacies of the NoSQL store enabling users to create, delete, and alter SQL tables, views, indexes, and sequences; upsert and delete rows singly and in bulk; and query data through SQL.

Apache Tajo is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources.

Data processing – Scripting language:

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

Data processing – XML Query processor:

Apache VXQuery will be a standards compliant XML Query processor implemented in Java. The focus is on the evaluation of queries on large amounts of XML data. Specifically the goal is to evaluate queries on large collections of relatively small XML documents.

Data Storage:

Apache ORC is a self-describing type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly.

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

Distributed Data management and Governance:

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are commonplace and thus should be automatically handled in software by the framework.

Distributed Job Management and Governance:

Apache Curator is a set of Java libraries that make using Apache ZooKeeper much easier. While ZooKeeper comes bundled with a Java client, using the client is non-trivial and error prone.

Hadoop Distribution platform:

Apache Bigtop is an Apache Foundation project for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configuration of the leading open source big data components. Bigtop supports a wide range of components/projects, including, but not limited to, Hadoop, HBase and Spark.

Job Scheduler:

Apache Oozie is a workflow scheduler system to manage Hadoop jobs. It is a server-based Workflow Engine specialized in running workflow jobs with actions that run Hadoop MapReduce and Pig jobs. Oozie is implemented as a Java Web-Application that runs in a Java Servlet-Container.

Apache Database Tools:

Apache Accumulo is a sorted, distributed key/value store based on Google’s BigTable design. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process.

Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous master less replication allowing low latency operations for all clients.

Apache Cayenne is an open source persistence framework licensed under the Apache License, providing object-relational mapping (ORM) and remoting services. With a wealth of unique and powerful features, Cayenne can address a wide range of persistence needs. Cayenne seamlessly binds one or more database schemas directly to Java objects, managing atomic commit and rollbacks, SQL generation, joins, sequences, and more.

Apache Empire-db is a relational database abstraction layer and data persistence component that allows developers to take a much more SQL-centric approach in application development than traditional Object-relational mapping frameworks (ORM). By providing a unique type-safe object orientated command API Empire-db allows building highly efficient SQL-statements that take full advantage of all database features while eliminating the need for error-prone string operations and literals.

This A-to-Z listicles of Apache tools will tempt you to learn more about the open source tools. Whenever Apache is hatchling a new tool, our big data fanatics get overwhelmed and they always lay their hands on those new tools, try working with them, experimenting with those and they will keep on enduring this until they are confident enough to deploy the tool. And our VP, Technology always encourage and back them up for these kind of experimentations.

Our previous post was about the latest trends in big data analytics and this post is about the Apache tools that contributes for Big data. Can you feel a connection??? Yeah, that is all these blogs are about. We have a quite expertise when it comes to Big Data and we tend to keep our skills on toes. If you ever in need of advice or assistance in Big data, please reach out to us. We will be happy to extend our help.