Meetup.com is a social site for offline meetups. Every minute tens to hundred's of people RSVP, which means - accepting/denying to attend a meetup, which happens when a user clicks the yes/no button on each meetup page. Meetup offers the live stream of this RSVP data through its API. The data is available as json format through web socket.

I developed a simple data pipeline consisting of the following big data technologies.

Apache Spark

Apache Kafka

Cassandra

In a series of posts I will explain how this process works. For now, lets look at the below image.

The above image shows the flow of data across various tools. To easily understand the sequence, numbers from 1 to 6 are shown in each block.

Data Source:

The data source consists of two parts.

meetup.com which makes the live RSVP data available through web sockets.

A websocket program written in python which collects the data from meetup.com

Messages:

The message block consists of two parts.

Kafka producer - which transmits the data to the consumer as messages.

Kafka consumer - which receives the messages from producer.

Data Storage:

We store the messages received from consumer in Cassandra database, which is a column oriented database.

Data Analysis:

Final step, we use Apache Spark to
- Read the data from Cassandra.
- Perform data analysis.