Hadoop legacy

In the previous blog post I explained the basic concepts of data lakes. Some core problems which can occur in data lakes were defined and I gave some hints to avoid them. Most of these pitfalls are caused by the traits of data lakes. Unfortunately, current Hadoop distributions can’t resolve them entirely. Additionally, the architecture itself is not the best for such solutions and Hadoop has a bad reputation for being extremely complex.
This time I would like to focus on the Hadoop environment problems in data lake implementations. Maybe they can be repaired?

Architecture

Hadoop has an irreparably fractured ecosystem in that all the responsibilities are only partially fulfilled by dozens of services. For example, for security there are Sentry and Ranger, for processing management there are NiFi and Falcon. The solution may be to select an existing distribution such as HDP, CDH or MEP.
Each one is different because the companies around Hadoop each have own unique vision.
Selecting the right distribution can be a real challenge, especially as each ships different Hadoop components, for example CDH contains Navigator and HDP Ambari.
Everyone wants to do it their own way and there is little cooperation between them. Even Facebook, probably the biggest Hadoop deployment in the world, forked Hadoop seven years ago and have since kept it closed source. 1.

Hadoop has been on the market for a long time. There are still some problems which have not yet been solved due to the fact that the Hadoop architecture and HDFS in particular do not allow good implementation of data lakes. For example, HDFS is a good fit for batch jobs without compromising on data locality. Right now, streaming is equally important and in the future it will be more important than batch processing.

There are three major categories of architectural problems in the Hadoop environment: security, versioning and lineage. First, I will point out the most important problems, then there will be an overview of existing alternatives.

Security

Hadoop clusters are used by many people with different backgrounds. Without a tool like Ranger, you can’t create access policies and you don’t have any audit logs in the Hadoop cluster services. Let’s look at the following example: you run a Spark job inside of which you use Hive. The job will work with the user who triggers the job and there is transfer of user identity. In most cases it is a user from some kind of workflow scheduler, which is why this scenario is not possible: I want to run Hive inside SparkSQL as a user other than the user who runs Spark SPARK-8659. As you can see, to provide consistent security over the cluster you have to integrate all services with a tool that provides fine-grained access control across cluster components.

Versioning

In a data lake you often run data ingestion and wrangling by pipelines that are developed by more than one developer. If you do not trace the changes in the pipeline, then bug hunting is extremely hard because it is not possible to disable a particular pipeline and read logs. In Hadoop, data versioning is not possible without an additional metadata layer (tools like Hive). If we can’t connect a pipeline to data without metadata, then guessing how the current pipeline works is extremely hard. Of course, you can give a unique name for pipeline deployment and then checkout a particular run with the data produced, but this is not the most efficient way to deal with errors during production. Even if you can find bugs and fix them, you still need to revert corrupted data; this isn’t possible without metadata.

Lineage

There is no lineage in Hadoop by default. Even with Atlas we need to integrate with other applications like Spark. Right now, Atlas provides limited integration with Hadoop applications. the rapidly growing Spark is not supported. Atlas could solve the problem, but it needs to provide integration for most Hadoop cluster components. Atlas provides REST and Kafka APIs for integration. In another post Navigation in the data lake using Atlas Bartłomiej describes how to integrate Spark via its REST API.

Operations

Creating a Hadoop cluster is not easy and maintenance is not a simple task at all. Implementing security measures is also a lot of work. Kerberos is not the latest technology and its concepts are certainly not easy to understand. Additionally, it must be well integrated with other technologies like LDAP and Active Directory. Kerberos deployment must be highly available and durable, but you need specialists.

With tools like Ambari and Impala you can easily manage services in the cluster. They have built-in monitoring and provide configuration tools for common components. This is one of the few parts of Hadoop that work properly.

Management

If you can’t provide a huge amount of unstructured data, then a data lake will only be a buzzword for your company. Finding a use case is one of the hardest implementation elements. A data lake is a long-term integration project that requires lot of resources. In some cases this may be an unprofitable investment. You need to invest in specialists as it is not possible to retrain your staff. The learning curve for Hadoop clusters is high, but the learning curve for security in Hadoop clusters is even harder. If you create a data lake, then you need to scale staff and hardware with your data very quickly, otherwise the data lake will prove to be a failure and you will lose money.

There is a huge difference between a DIY cluster and a production environment. You need to create availability architecture for the influx of new data. The security around the cluster should be bulletproof. All roles and access controls should be defined, but you should be aware that this takes a lot of time and resources to build.

Make sure you understand the hardware and networking requirements as these are the most important components in a data lake.

Amendments?

There are many organisations benefiting and improving the Hadoop ecosystem. For example, Hadoop has a single NameNode where the metadata about the Hadoop cluster is stored. Unfortunately, as there is only one of them, the NameNode was a single point of failure for the entire environment. Right now Hadoop can have two separate NameModes, one of which is active while the other is in standby mode. Unfortunately the changes are fragmentary and there is no collaboration between Hadoop stack providers. They focus on the products around HDFS, rather than the core.

Alternatives

There are lots of data lake implementations; however, I’m going to focus on one that I have tested – Kylo.

Kylo

Kylo is data lake management software produced by Teradata. It is open sourced by ThinkBig and built on top of NiFi and Spark. Its security module can be integrated with Ranger and Sentry. Navigator or Ambari are available for cluster management. Kylo is responsible for:
– Data ingestion – mix of streaming and batch processing,
– Data preparation – helps to cleanse data in order to improve its quality and to accredit data governance,
– Discovery – analysts and data scientists can search and find what data is available to them.
Kylo provides a consistent data access layer for data infrastructure staff and data scientists. You can provide metadata for entities and define policies for components. It can automatically build lineage for data from NiFi metadata. For DataOps there is monitoring for feeds and jobs running in a system. Additionally there is also a tool to monitor services in a cluster. Kylo provides statistics for data transformations which can help evaluate the quality of data. Data scientists can review data with the Visual Query tool.

Problems

Kylo UI as an NiFi template booster looks like a fine solution for data lakes as it supplies additional information about data on pipelines. The downside is that you are limited to Kylo, for which there is no easy way to integrate with other tools. Such a lack of opportunities is never good, especially in systems like data lakes, as someday you might find that there’s some component missing. There are also other problems in Kylo:
– Integration with NiFi is inconsistent: if someone with access to NiFi changes a feed then Kylo might not notice.
– Defining templates in NiFi and re-importing is pain in the butt as sometimes old feeds hang up after import.
– Little information about errors/warnings for flows in the Kylo UI. NiFi errors are sometimes not propagated to Kylo. You need to switch to NiFi and check all the processors in the flow.
– There is service monitoring, but it is not possible to manage services from the UI. You need to use Ambari.
– It is not possible to define fine-grained access policies for data directly from Kylo. You need to use Ranger, but even that won’t allow you to set policy rules for entities from Kylo UI.
– Custom flow does not integrate with default templates without additional work as it can’t provide metadata for Kylo (no data lineage); you need to integrate the custom NiFi flow with existing core templates.
– If you are using a Spark job in an NiFi processor, then you don’t have lineage from inside the job.
– There is no inheritance of identity. If you log in to Kylo UI and run the feed, then the feed will be run with NiFi’s identity, not yours.

Pachyderm

It’s very hard to manage cooperation between data infrastructure and data scientist teams in a Hadoop cluster. In the Hadoop ecosystem, the main players are trying to solve this problem by adding metadata everywhere. This looks reasonable and is not a problem when we are using metadata to keep business information for unstructured data. But we want to change this situation when we are using metadata for data governance and lineage. Pachyderm was created to give large organizations a better way to manage data infrastructure. Pachyderm uses Docker and Kubernetes, both of which are advantageous to resolving BigData problems.
I’m not going to delve into the details of Pachyderm right now. My last, third post from this series will be dedicated to it.

Conclusion

Hadoop is still struggling with a few problems in the current data lake implementation.
These are caused by bad system fragmentation which is a result of the lack of a common vision between the main distributors. There are ways to fix these problems but, unfortunately, some of the problems are caused by the Hadoop architecture, which cannot be changed easily.

In the next blog post, I will try to explain the core concepts of Pachyderm and we will test it on a Raspberry Pi cluster. There will be some Ansible playbook to run the cluster quickly and we will see the main difference between Hadoop and Pachyderm in a real example.