atlas-dev mailing list archives

[jira] [Updated] (ATLAS-183) Add a Hook in Storm to post the topology metadata

Date

Thu, 08 Oct 2015 17:30:26 GMT

[ https://issues.apache.org/jira/browse/ATLAS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Ahn updated ATLAS-183:
-----------------------------
Description:
Apache Storm Integration with Apache Atlas (incubating)
Introduction
Apache Storm is a distributed real-time computation system. Storm makes it easy to reliably
process unbounded streams of data, doing for real-time processing what Hadoop did for batch
processing. The process is essentially a DAG of nodes, which is called topology.
Apache Atlas is a metadata repository that enables end-to-end data lineage, search and associate
business classification.
Overview
The goal of this integration is to at minimum push the operational topology metadata along
with the underlying data source(s), target(s), derivation processes and any available business
context so Atlas can capture the lineage for this topology.
It would also help to support custom user annotations per node in the topology.
There are 2 parts in this process detailed below:
Data model to represent the concepts in Storm
Storm Bridge to update metadata in Atlas
Data Model
A data model is represented as a Type in Atlas. It contains the descriptions of various nodes
in the DAG, such as spouts and bolts and the corresponding source and target types. These
need to be expressed as Types in Atlas type system. At the least, we need to create types
for:
Storm topology containing spouts, bolts, etc. with associations between them
Source (typically Kafka, etc.)
Target (typically Hive, HBase, HDFS, etc.)
You can take a look at the data model code for Hive. Storm should only be simpler than Hive
from a data modeling perspective.
Pushing Metadata into Atlas
There are 2 parts to the bridge:
Storm Bridge
This is a one-time import for Storm to list all the active topologies and push the metadata
into Atlas to address cases where Storm deployments exist before Atlas.
You can refer to the bridge code for Hive.
Post-execution Hook
Atlas needs to be notified when a new topology is registered successfully in Storm or when
someone changes the definition of an existing topology.
You can refer to the hook code for Hive.
Example use case:
Custom annotations associated with each node in the topology.
For example: Data Quality Rules, Error Handling, etc. A set of annotations that enumerates
rules handling nulls– all nulls for a column get filtered, etc.
> Add a Hook in Storm to post the topology metadata
> -------------------------------------------------
>
> Key: ATLAS-183
> URL: https://issues.apache.org/jira/browse/ATLAS-183
> Project: Atlas
> Issue Type: Sub-task
> Affects Versions: 0.6-incubating
> Reporter: Venkatesh Seetharam
> Fix For: 0.6-incubating
>
>
> Apache Storm Integration with Apache Atlas (incubating)
> Introduction
> Apache Storm is a distributed real-time computation system. Storm makes it easy to reliably
process unbounded streams of data, doing for real-time processing what Hadoop did for batch
processing. The process is essentially a DAG of nodes, which is called topology.
> Apache Atlas is a metadata repository that enables end-to-end data lineage, search and
associate business classification.
> Overview
> The goal of this integration is to at minimum push the operational topology metadata
along with the underlying data source(s), target(s), derivation processes and any available
business context so Atlas can capture the lineage for this topology.
> It would also help to support custom user annotations per node in the topology.
> There are 2 parts in this process detailed below:
> Data model to represent the concepts in Storm
> Storm Bridge to update metadata in Atlas
> Data Model
> A data model is represented as a Type in Atlas. It contains the descriptions of various
nodes in the DAG, such as spouts and bolts and the corresponding source and target types.
These need to be expressed as Types in Atlas type system. At the least, we need to create
types for:
> Storm topology containing spouts, bolts, etc. with associations between them
> Source (typically Kafka, etc.)
> Target (typically Hive, HBase, HDFS, etc.)
> You can take a look at the data model code for Hive. Storm should only be simpler than
Hive from a data modeling perspective.
> Pushing Metadata into Atlas
> There are 2 parts to the bridge:
> Storm Bridge
> This is a one-time import for Storm to list all the active topologies and push the metadata
into Atlas to address cases where Storm deployments exist before Atlas.
> You can refer to the bridge code for Hive.
> Post-execution Hook
> Atlas needs to be notified when a new topology is registered successfully in Storm or
when someone changes the definition of an existing topology.
> You can refer to the hook code for Hive.
>
> Example use case:
> Custom annotations associated with each node in the topology.
> For example: Data Quality Rules, Error Handling, etc. A set of annotations that enumerates
rules handling nulls– all nulls for a column get filtered, etc.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)