Guide to the New Features of Hortonworks DataFlow 2.0

We recently hosted a webinar on the newest features of Hortonworks DataFlow 2.0 highlighting:

the new user interface

new processors in Apache NiFi

Apache NiFi multi-tenancy

Apache NiFi zero master clustering architecture

Apache MiNiFi

One of the first things you may have noticed in Hortonworks DataFlow 2.0 is the new user interface based on Apache NiFi 1.0. This new interface has context sensitive palettes tailored by job function, movement of high level management and settings related options to a global menu that is positioned in the upper right of the interface, a breadcrumb trail to help guide user actions and more. For more info check out this blog post Hortonworks DataFlow 2.0 Gets a Fresh Face.

Secondly, there are many new processors in HDF 2.0 – 170+ now, 30% more than HDF 1.2. Some of the new ones include new Kafka processors, TableFetch, HIVE processors, Avro conversion and MQTT.

Additionally, Hortonworks DataFlow supports Apache MiNiFi, a subproject of Apache NiFi designed for first mile data. MiNiFI is a subproject of NiFi designed to solve the difficulties of managing and transmitting data feeds to and from the source of origin, often the first/last mile of digital signal, enabling edge intelligence to adjust flow behavior/bi-directional communication

1. When we click start on a processor, I understand processing will be continuous. How often for example does the twitter API polls for messages?

Every processor has a concept of how it is scheduled. When you right click on a processor – one of the tabs on the configuration is scheduling. There are two main options – Timer driven and CRON driven. Timer driven is the default scheduling strategy. For timer driven, it is set to default at 0 seconds which means as fast as possible. Depending on the source you typically want to poll on a certain interval, so you may set it to be once a minute, or every 5 minutes for example. You can also choose CRON driven, you can use your standard CRON expressions and have it poll data at specific points in time. So it all depends on how that processors is configured to be scheduled. More info here. https://community.hortonworks.com/questions/63513/helping-setting-up-cron-based-nifi-processor.html#answer-63543

2. Using HDF alone, can we do analytics on data in motion. Or does it have to get into hadoop for doing analytics?

No you do not need Hadoop – you can do analytics on data in motion. HDF is based on several technologies – NiFI, Kafka and storm. A common use case is to use data to put data into Kafka, and then do streaming analytics there. From the you can whatever you would like with those results – for example, you could have Kafka put the results back onto another Kafka messaging topic, have NiFi listening to those results and then send those results somewhere. Analytics for data in motion is a major use case for HDF. An important class of analytics like reference lookup enrichments that can be done in NiFi as the data flows and a class of complex joins and windowing based analytics that can be done well in Storm.

Think of HDF enabling data-in-motion analytics, to turn data streams into perishable insights in real-time. HDP on the other side of the house, allows you to have a broader view of analytics, based on analyzing historical data.

3. Have the HDF 2.0 UI changes provided a way to create completely separate dataflows/worksheets in order for different teams/departments to use HDF/NiFi without seeing the other teams/departments’ dataflows/processors? In other words, does it provide better multi-tenancy for larger enterprises?

Yes – this is definitely one of the major changes in HDF 2.0. Rather than an exclusive worksheets model what we’ve unlocked is a real multi tenant model whereby you know that other tenants exist but unless they give you access you cannot know what they’re doing. This is all organized and controlled via a hierarchy and you can isolate different teams within the canvas. Please refer to slides 41 and the on-demand video starting at 39:25.

4. In Multi-Tenancy with two users A and B, sharing single NiFI both need to write into a shared Hadoop/HDFS cluster, how is access controlled so that only A can read A data and B can read B data?

From a NiFi perspective, you can certainly create two separate dataflows/process groups, one for A, the other for B. The two users would have their own access policies to the two process groups. Process Group A and data coming out of Process Group A is only accessible to user A, Process Group B and data coming out of Process Group B is only accessible to user B. From HDFS perspective, you can have two separate PutHDFS processors in NIFI, which ingest data into different HDFS folders, and you can define access control over those folders within HDFS.

5. How can I route API requests to a cluster when a coordinator, or any client, could go online/offline at anytime? I understand that the cluster will stay up, but how do I get my API requests routed. Is this also part of what a Primary Node would be used for?

The great news is you can simply route your request to another node in the cluster that is available. The cluster management will take care of routing that to all other nodes for you and ensuring no operations can occur that would create consistency issues. This doesn’t have any relationship to the primary node concept as that only has to do with which node will execute tasks that must only live on a single node.

6. What is the licensing model for NiFi?

Apache NiFi is an open source project licensed under the Apache Software License v2. Hortonworks offers support subscriptions for Apache NiFi as part of the Hortonworks DataFlow product.

7. Is HDF integrated with enterprise schedulers like ESP or Zeke? Can we get a small overview how we would go about doing that?

For people coming to NiFi from traditional ETL/ELT systems we’re seeing a couple patterns emerge. They tend to want to view each flow discretely and have used other systems that way and are trying to use NiFi that way.

The model you’re talking about sounds like a ‘job oriented’ model. We’ve had some people ask about the process groups per job type approach and we’ve also had people ask about giving completely isolated canvas altogether. We’ve effectively taken the excellent lessons learned in various parts of the ETL community over the past many years and combined with the emerging needs of continuous flow based use cases and that is what Apache NiFi is designed for.

NiFi has been used to handle hundreds or even thousands of different dataflows all through the same system/cluster. But, the flows weren’t built on a per job basis. Common steps among the various flows were factored into things like process groups and those groups were connected as necessary. The data from these many different flows were routed through common or appropriate sections/groups as appropriate. Given that NiFi itself allows you to use context, content, and metadata there is a lot of flexibility in how data is acquired, routed, transformed, and delivered. It is just a different way of approaching these cases.

We’d also like to talk with you about whether there are opportunities to approach the problem you’re trying to solve in a different way as well. We suspect the right answer is going to be some combination of these different perspectives.

8. Can you provide an example of how the per-processor authorization is expected to be used? Apart from processors, what else can be authorized?

Everything in NiFi can be authorized. Processor groups, including the root canvas, reporter services, controller tasks – even who can create policies. Pretty much everything that NiFi can do, can be controlled. For example, you could have one user who may be able to change anything on the canvas, but could not modify policies for users and groups.

For per processor authorization – Yes – there may one processor may have a specific policy applied to it – for instance has a restriction to only a few people who can access it. For instance, it may have sensitive information, like URLs or passwords that you don’t want others to see so can completely lock down the processor so no one else can see it.

9. What is the difference between NiFi vs Storm?

You use NiFi for simple event processing, for transformations and enrichment – typically on an individual piece of data. On the other hand, you want to use Storm for complex event processing such as situations which require windowing, or involve merging multiple data sets, or performing analysis to detect patterns from aggregate data sets together. While the tools are at times used for overlapping objectives a good rule of thumb is to think about what they’re intended for. NiFi’s job is to manage the flow of information across an enterprise from every sensor and source to every consumer be it a processing system or a storage and query system. Storm is focused on the data processing challenge. So, they are really complementary and in fact this is why HDF leverages them both.

10. Can we use HDP Ambari to manage HDF?

Currently you need two separate Ambari instances. Integrated Ambari is planned. More info about HDF 2.0 and Ambari here.

11. Is there any requirement to use Ambari if we want to use Ranger for security, or are they totally disconnected?

No, there is no requirement – you can chose to use the NIFi internal authorizer, or the Ranger authorizer. Apache Ranger is a project that can work with NiFi without Ambari. Please note Ambari makes it much easier to manage those two technologies together, it helps you configure them, helps with secure connections between them and removes a lot of the manual steps you would have to do.

12. Are there any known features/components planned in foreseeable future for HDF?

Tags:

Your email address will not be published. Required fields are marked *

Comment

Name*

Email*

This website uses cookies for analytics, personalisation and advertising. To learn more or change your cookie settings, please read our Cookie Policy. By continuing to browse, you agree to our use of cookies.