With all of this talk about the Data Pipeline, you might think that we here at Yelp are like a kid with a new toy, wanting to keep it all to ourselves. But unlike most kids with new toys, we like to share–which is why we’ve decided to open-source the main components of our pipeline, just in time for the holiday season.

Without further ado, here are some stocking-stuffers for the upcoming holidays:

The MySQL Streamer tails MySQL binlogs for table changes. The Streamer is responsible for capturing individual database changes, enveloping those changes into Kafka messages (with an updated schema if necessary), and publishing to Kafka topics.

The Schematizer service tracks the schemas associated with each message. When a new schema is encountered, the Schematizer handles registration and generates migration plans for downstream tables.

The Data Pipeline clientlib provides an easy-to-use interface for producing and consuming Kafka messages. With the clientlib, you don’t have to worry about Kafka topic names, encryption, or partitioning your consumer. You can think at the level of tables and data sources, and forget the rest.

The Data Pipeline Avro utility package provides a Pythonic interface for reading and writing Avro schemas. It also provides an enum class for metadata like primary key info that we’ve found useful to include in our schemas.

The Yelp Kafka library extends the kafka-python package to provide features such as a multiprocessing consumer group. This helps us to interact with Kafka in a high-performance way. The library also allows our users to discover information about the multi-regional Kafka deployment at Yelp.

A diagram of different components in the Data Pipeline. Individual services are shown with square edges, while shared packages have rounded edges.

Each of these projects is a Dockerized service that can easily be adapted to fit your infrastructure. We hope that they’ll prove useful to anyone building a real-time streaming application with Python.

Happy hacking, and may your holidays be filled with real-time data!

Want to know more about the Data Pipeline? Check out our blog series, which explores
how we stream MySQL updates in real-time with an exactly-once guarantee,
how we automatically track & migrate schemas, how we process and transform
streams, and finally how we connect all of this into datastores like
Redshift and Salesforce.