My thoughts on the field of Complex Event Processing. Since I'm a technical guy, they go mostly on the technical side. And mostly about my OpenSource project Triceps. To read the Triceps documentation, please follow from the oldest posts to the newest ones.

Monday, March 5, 2012

Time-limited propagation, part 3

In the last example if we keep aggregating the data from hours to days and the days to months, then the arrival of each new packet will update the whole chain. Sometimes that's what we want, sometimes it isn't. The daily stats might be fed into some complicated computation, with nobody looking at the results until the next day. In this situation each packet will trigger these complicated computations, for no good reason, since nobody cares of them until the day is closed.

These unnecessary computations can be prevented by disconnecting the daily data from the hourly data, and performing the manual aggregation only when the day changes. Then these complicated computations would happen only once a day, not many times per second.

Here is how the last example gets amended to produce the once-a-day daily summaries of all the traffic (as before, in multiple snippets, this time showing only the added or changed code):

# an hourly summary, now with the day extracted
our $rtHourly = Triceps::RowType->new(
time => "int64", # hour's timestamp, microseconds
day => "string", # in YYYYMMDD
local_ip => "string", # string to make easier to read
remote_ip => "string", # string to make easier to read
bytes => "int64", # bytes sent in an hour
) or die "$!";
# a daily summary: just all traffic for that day
our $rtDaily = Triceps::RowType->new(
day => "string", # in YYYYMMDD
bytes => "int64", # bytes sent in an hour
) or die "$!";

The hourly rows get an extra field, for convenient aggregation by day. This notion of the day is calculated as:

This logic doesn't check whether any data for that day existed. If none did, it would just produce a row with traffic of 0 bytes anyway. This is different from the normal aggregation but here may actually be desirable: it shows for sure that yes, the aggregation for that day really did happen.

The main loop then gets extended with the day-keeping logic and with the extra command to dump the daily data:

When a packet for the next day arrives, it has three effects: (1) inserts the packet data as usual, (2) finds that the previous packet data is obsolete and flushes it (without upsetting the hourly summaries) and (3) finds that the day has changed and performs the manual aggregation of last day's hourly data into daily.

This shows the eventual contents of the daily summaries. The order of the rows is fairly random, because of the hashed index. Note that the hourly summaries weren't flushed either, they are all still there too. If you want them eventually deleted after some time, you would need to provide more of the manual logic for that.