Streaming data is becoming an essential part of every data integration project nowadays, if not a focus requirement, a second nature. Advantages gained from real-time data streaming are so many. To name a few: real-time analytics and decision making, better resource utilization, data pipelining, facilitation for micro-services and much more.

Python has many modules out there which are used heavily by data engineers and scientist to achieve different goals. While “Scala” is gaining a great deal of attention, Python is still favorable by many out there, including myself. Apache Spark has a Python API, PySpark, which exposes the Spark programming model to Python, allowing fellow “pythoners” to make use of Python on the amazingly, highly distributed and scalable Spark framework.

Often, persisting real-time data streams is essential, and ingesting MapR Streams / Kafka data into MapR-DB / HBase is a very common use case. Both, Kafka and HBase are built with two very important goals in mind: scalability and performance. In this blog post, I’m going to show you how to integrate both technologies using Python code that runs on Apache Spark (via PySpark). I’ve already tried to search such combination on the internet with no luck, I found Scala examples but not …