Distributed Task Execution

This recipe is intended to demonstrate how task dependencies can be modeled using primitives provided by Helix. A given task can be run with the desired amount of parallelism and will start only when upstream dependencies are met. The demo executes the task DAG described below using 10 workers. Although the demo starts the workers as threads, there is no requirement that all the workers need to run in the same process. In reality, these workers run on many different boxes on a cluster. When worker fails, Helix takes care of re-assigning a failed task partition to a new worker.

Redis is used as a result store. Any other suitable implementation for TaskResultStore can be plugged in.

Workflow

Input

10000 impression events and around 100 click events are pre-populated in task result store (redis).

ImpEvent: format: id,isFraudulent,country,gender

ClickEvent: format: id,isFraudulent,impEventId

Stages

FilterImps: Filters impression where isFraudulent=true.

FilterClicks: Filters clicks where isFraudulent=true

impCountsByGender: Generates impression counts grouped by gender. It does this by incrementing the count for ‘impression_gender_counts:<gender_value>’ in the task result store (redis hash). Depends on: FilterImps

impCountsByCountry: Generates impression counts grouped by country. It does this by incrementing the count for ‘impression_country_counts:<country_value>’ in the task result store (redis hash). Depends on: FilterClicks

clickCountsByGender: Generates click counts grouped by gender. It does this by incrementing the count for click_gender_counts:<gender_value> in the task result store (redis hash). Depends on: impClickJoin

clickCountsByGender: Generates click counts grouped by country. It does this by incrementing the count for click_country_counts:<country_value> in the task result store (redis hash). Depends on: impClickJoin

Creating a DAG

Each stage is represented as a Node along with the upstream dependency and desired parallelism. Each stage is modeled as a resource in Helix using OnlineOffline state model. As part of an Offline to Online transition, we watch the external view of upstream resources and wait for them to transition to the online state. See Task.java for additional info.

Apache Helix, Apache, the Apache feather logo, and the Apache Helix project logos are trademarks of The Apache Software Foundation.
All other marks mentioned may be trademarks or registered trademarks of their respective owners.