Bottom Line:
Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort.Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result.Titian is built directly into the Spark platform and offers data provenance support at interactive speeds-orders-of-magnitude faster than alternative solutions-while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.

Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence (e.g., from log files) and performing trial and error debugging. To aid this effort, we built Titian, a library that enables data provenance-tracking data through transformations-in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds-orders-of-magnitude faster than alternative solutions-while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.

Mentions:
Figure 9 is a logical view of the data lineage that each agent captures in our running example from Figure 2. At the top of the figure, we show some example raw data corresponding to the HDFS input log, intermediate data, and final results. Recall, in the original running example, we were counting the number of occurrences for each error code. Here, we would like to trace back and see the actual log entries that correspond to a “Failure” (code =4), as shown in the Spark program of Figure 10.

Mentions:
Figure 9 is a logical view of the data lineage that each agent captures in our running example from Figure 2. At the top of the figure, we show some example raw data corresponding to the HDFS input log, intermediate data, and final results. Recall, in the original running example, we were counting the number of occurrences for each error code. Here, we would like to trace back and see the actual log entries that correspond to a “Failure” (code =4), as shown in the Spark program of Figure 10.

Bottom Line:
Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort.Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result.Titian is built directly into the Spark platform and offers data provenance support at interactive speeds-orders-of-magnitude faster than alternative solutions-while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.

Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence (e.g., from log files) and performing trial and error debugging. To aid this effort, we built Titian, a library that enables data provenance-tracking data through transformations-in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds-orders-of-magnitude faster than alternative solutions-while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.