IT infrastructure monitoring: Real-time data monitoring, such as database monitoring, server monitoring, and operating system monitoring. Diversified sensors are deployed, and the sensors collect massive volumes of data.

In these scenarios, the data volume is often larger than that generated during Alibaba's Double 11 Shopping Festival. How should we process such massive data? How can we implement real-time streaming data processing?

In this article, we will take a look at how PostgreSQL can be used for Stream Processing in IoT applications for real-time processing of trillions of data records per day. PostgreSQL provides a good stream-based data processing product, with a real-time computing capability up to 100,000 records per second on a single common X86 server.

Setting up Streaming Application with PipelineDB

Download and install PipelineDB, which is a PostgreSQL-based streaming data processing database.

Scenario 3: Vehicle Status Analysis

Collect statistics on the traffic information by probe at each intersection. Assume that one probe is mapped to one intersection.

The first requirement is the same as that in the previous example for drawing the vehicle trajectory. Statistics of traffic at each intersection is performed by probe ID.

The methods for using PipelineDB are similar, so no more examples are given here.

Scenario 4: Predicting Stock Prices in Real Time

You can use MADlib or PLR to implement multiple rounds of regression. Select the best R2 and predict the next group of stock prices based on the intercept and slope.

You need to use UDFs in this scenario. For more information about how to use UDFs, see the previous articles.

No more examples are given here.

Scenario 5: Targeted Marketing through Wi-Fi Sensors

Collect real-time statistics from Wi-Fi sensors in shopping malls.

Count the people in each store in real time based on the location information provided by Wi-Fi. Statistical dimensions include the average stay duration and total stay duration of each store.

Scenario 6: Miscellaneous

Assume that the existing functions of PG cannot cope with your data processing scenario. What should you do? PG provides a series of APIs, such as UDFs, data types, operators, and indexing methods, to help you solve this problem. You can use these APIs to address your business requirements.

There are many other use methods, which cannot be completely listed here.

Integrating Kafka with PipelineDB

The following provides a very popular message queue, from which PipelineDB can retrieve data and perform real-time computing.

Start a local NGINX server and use Siege to simulate HTTP requests. NGINX records these behaviors and stores them in a JSON file.

Start a local Kafka server and use kafkacat to continuously push NGINX access logs to Kafka.

Subscribe to Kafka messages in PipelineDB and convert the data into the desired statistical information in real time, for example, web page visitor count or latency.

When PipelineDB is used together with Kafka, more application scenarios are available.

Conclusion

For many DBAs in the field of IoT, you've probably wondered how to build a larger real-time message processing cluster with PipelineDB and Kafka. A common approach to address this issue is to properly plan the data sharding rules to avoid cross-node statistics. If cross-node access is needed, use dimension tables on each node.

For example, how should you process trillions of messages every day? Based on the preceding stress testing results, each server processes an average of 0.1 million records per second (8.6 billion records per day) and 116 PostgreSQL servers are needed for computing. This is easy and convenient.

As shown in the following figure, every layer can be expanded, including LVS, HAproxy, Kafka, PostgreSQL, and HAWQ.