Exploring Event Driven Architectures with Esper

At Java One Thomas Bernhardt and Alexandre Vasseur explained the concepts of event driven application servers and the Esper project.

Event driven application servers are a new category of servers, proving a runtime and supporting infrastructure services (transport, security, event journaling, high availability, connectors, etc.) to servers designed to be able to process over 100,000 events/sec. As well as event processing, event driven application servers are able to combine event information with long lived historical data (usually obtained via relational database queries) and performing temporal correlation and matching on the event streams.

There are two concepts that make event systems different from messaging system:

Fully featured application server are still a few years away, but developers can implement event driven architectures in stand alone applications, Java EE applications and Spring applications today using Esper from Codehaus. Esper version 1.0 (as reported by InfoQ) was released in June 2006, and is a lightweight, embeddable open source implementation of both ESP and CEP.

Even though event driven application servers are a few years away, Esper is ready for production use today. Integrating Esper into your applications is easy, and will allow you to provide features that will anticipate business and customers needs in real-time.

I took a look at esper's pattern matching implementation last week and was wondering if anyone test espers with a dataset of 500K or 1million objects?

I should state that I work on rule engines and have been studying RETE for the last six years. Looking at esper's node classes and design, my guess is espers will likely run into scalability issues with 250K objects or more. The way joins are performed is going to lead to inefficient pattern matching, which will lead to huge CPU usage. A mature RETE engine can handle 1million facts without any problems. This is from first hand experience.

yep we have a benchmark available out of the RFID domain that features tracking 3000 assets as they move from zone to zone detecting when assets in a group split between zones. We have been able to get 110,000 events per second, sustained, on a laptop with DualCore 2.4 GHz. Details are on the JavaOne slides downloadable from our site.

We also have various microbenchmarks in different examples and regression tests that we run as part of our regular build that process large data sets. Note there is no industry standard benchmark at this time. We are planning to have a performance test suite available that can be used for capacity planning, hopefully soon.

Esper is designed to process large volumes of streaming data on a continuous basis. It indexes join fields and is therefore able to handle patterns involving joins rather well.

Even if the join nodes index the data, the order of the joins (aka query plan) can make a huge difference. Depending on the query and execution plan, the delta between an optimized query plan vs non-optimized can range from 3-4 orders of magnitude. this is from first hand experience comparing non-RETE rule engine vs RETE rule engine on a dataset over 500K objects. by rule engine, I mean an engine that performs pattern matching regardless of the algorithm used to compile the query statements.

Thanks to Ian and InfoQ for our JavaOne session coverage! If any JavaOne attendee want to send us feedback on our session please email us (avasseur on codehaus).

I just want to stress out that Esper and ESP/CEP is about continuous streams and real time filtering, aggregation and pattern detection among event streams. It is not a rule engine and I don't think we can really draw a comparison there (and this is not what the industry seems to be at neither).

At that stage I believe most of the users are more interested by the kind of problem EDA and ESP/CEP enables them to solve rather than raw performance.

Peter, I thus assume all systems that have to deal with 500K "object dataset" (unclear what we are talking about there) require some optimization. That said when it comes to ESP/CEP like Esper there are 4 things you need to consider - and I 'd be please to hear your comments if the same are key performance evaluation criteria in a rule engine (which I again argue is a different beast for a different purpose):

number of statements configured

throughput of input events coming in for evaluation

matching or output ratio (or filtering ratio if you want)

and latency (usually in the order of a few ms or less and real-time JVM or pauseless GC start to help us a lot there)

I believe our RFID sample that Tom commented illustrates some of this and gave our audience figures to remember at the end of a 1 hour session (2000 statements, 100K event/s on utility laptop, less than 2% matching ratio). I'd be happy to report some more on performance if you want to submit a use case that you would like us to consider for a benchmark (we are also looking at growing the commiter base if you want to bring some of your knowledge on that!). Did you wonder how (if at all) you could implement the RFID asset tracking example we have using something like Drools (RETE based rule engine, I don't know about its optimization)?

It might be you'll come up with another use case that can actually be solved by both a rule engine and a ESP/CEP engine, this is likely an overlap area (yes there are some) and there at the border line, you may have deuce or one solution may clearly outperform the other one as it will be just the right tool for the task. It 'd be nice to start such an exercice.

you are right the query plans are important for the regular inner joins as well as outer joins, and they each require different execution plans. As our software is open-source, you can look at the code implementing the query analyze down to the execution, in packages net.esper.eql.join.plan. -Tom (N)Esper rocks

I was thinking of real-time trading systems, which handle a constant stream of transactions. The input varies between trading firms. Some firms handle over 10K transaction/second, other handle less than 1K/second. To me, CEP and ESP is just a new term, but it's the same old stuff that Event driven architecture (EDA) have been doing for decades. Say we have a system that handle 2K transactions/second.

Orders generaly will stay in a OMS system for several hours from the time an order is sent to the time the order is closed. The RFID example isn't all the interesting to me. I've worked on real-time pre-trade compliance systems have to have handle thousands of diversification rules. They include government regulations, aggregations, calculating risk (exposure) and rating. These are real-time with transactions which are about 3.5K messages. Given the constantly shifting data, calculating the risk of a mutual fund using price aggregates, ratings, and aggregate ratings is rather challenging. Each calculation is rather simple by itself. It's keeping up with the constant stream. an example rule in CLIPS format might look like this.

If I wasn't so lazy, I'd translate it to SQL syntax, but hopefully the example provides some context.

I see quite a few CEP and ESP vendors tell people to use their system for algorithmic trading. Some try to make sound like it's new, but it's basically what OMS system have been doing for over a decade. Many of the existing system have been using RETE to do real-time event processing. It's just most people outside don't know about it. Then again, most of the firms using RETE to do EDA keep quiet, since they consider it a strategic advantage. The military has been using RETE to do event filtering for command control systems for over a decade. You can imagine how much data a military command control system handles per second with radar and satellite data streaming in 24/7.

Looking at the QueryGraph class, I get the impression, the query plan processes the sql where clause in the order it was written. It also appears a join node in the query graph can take a list of objects on either side. Is that an accurate interpretation?

No the order of the expressions in the where clause does not really impact query plans, you got that wrong. Esper builds an internal model based on the where and outer join criteria and works from there. -Tom

I was just reading over the docs on esper website and had a question. In the "select avg(price) from StockTick.win:time(30 sec) where symbol='BEA'" example, does Esper run the query every 30 seconds? If so, does that create a thread to poll? The kinds of event processing I've dealt with didn't have a sliding window of x seconds. Instead, it's more like "when an optimal for the transaction is found, alert the trader". The other kinds of event processing I'm familiar with also don't use a sliding window. Should an event satisfy the conditions, the trigger is executed. If esper is using a polling mechanism, I question is "event processing" accurate? A polling mechanism is more like a batch process with a short latency between executions.

Peter, you raise a good question and again this is one the key difference I am urging you to make when putting RETE rules engine and ESP/CEP engine like Esper in the same area while they are not. Time is a first class citizen in Esper, as well as sliding window (governed by time or number of events etc). On the statement you take from our doc, there is no polling at all - it is a continuous query over a sliding window and you literaly get something happening in the engine (that filters out or do the aggregation computation and/or produces an output event) everytime an event flows in.Alex

Thanks for clarifying that. I'm not sure I get it, but I'll attempt to summarize what I "think" I understand. Esper doesn't use a thread to execute queries periodically. Instead, when an event enters the system, it goes through a set of filters, which are the compiled sql statements.

The sliding window then defines a condition which says "if x condition happens at a max/min of x time, then do something". The kind of real-time processes I've worked with are OMS related. This means there is no max/min sliding window.

There are thousands of transactions in the system and all of them have a different expiration time. In a system like an OMS, it can't do win:time(30 sec) because that doesn't make any sense. A sell order might say mininum price of xx.xx dollars and x shares. If any buy matchs that price and shares, it should go through immediately. Waiting for 30 seconds could mean someone else fill that order.

The more I look at CEP/ESP, the less useful it becomes. When i compare RETE to Esper, I'm only looking at the compilation of the query. RETE provides one of the most efficient ways of compiling a query into an optimized query plan. Those who say RETE is not a good fit for EDA either A) have never bothered to study RETE or B) have misconceptions about what RETE is.

I see alot of people saying "RETE is wrong for EDA, CEP, ESP" and go on to show a sql like query to prove their point. The first part of RETE is compiling a statement into an optimal query plan. The second part is the runtime indexing. One can apply RETE compilation and forget the runtime, or adapt the algorithm for temporal execution. RETE compilation has nothing to do with whether "time is a first class citizen". It is just an efficient method for compiling statements into optimal relational queries.

Just because some rule engines don't support temporal logic, does not mean RETE is wrong or inappropriate as some commercial vendors are saying. To my knowledge, Esper has never made such a claim, but I have seen some commercial CEP vendors make those kinds of statements. For those who forget, temporal logic is one of the areas that AI and expert systems have pushed. Many of the advances in temporal logic came from AI research.

i'm curious, how does a CEP system handle removal of events. in rule engine terms, when an event is retracted. I've tried to read up on CEP/ESP the last few days and the literal seems rather thin compared to the mountain of literal on pattern matching and rule engines.

There is some ongoing EDA reference architecture work going on at OMG so you might be interested in adding your bits there. As I said I think CEP and rule overlap some, ESP far less, and all are part of EDA - and I am bias as well.

I'd be again happy to consider a use case with real working code. If I ask an ESP/CEP engine to handle the OMS side of thing, might be a rule engine will do a better job. If I ask a rule engine to detect a tripple bottom pattern on a stock tick, might be an ESP/CEP engine will do a better job. It could be the RFID sample we have in our Esper JavaOne slides can be solved both ways. If one want to give it a try with his favorite rule engine please do. The statements are in the slides and all the running code + demo GUI is in our SVN.

By the way, it happens a very similar thread on RETE 'vs/for' CEP/ESP was started here. Could you confirm you posted there as well on May 7 - ie Peter == woolfel? The posts from this pseudo look very similar if not the same...

Yup, I posted several comments on jboss blog. In terms of OMS and real-time trading systems, I know of several that use RETE engines for that. One of my long term goals is to create a set of "realistic" standard benchmarks for rule engines. One of them is a pre-trade compliance scenario for processing transactionsets. I have some use cases in clips format, but it's not something the average developer is going to understand. If you want a realistic use case, i would suggest looking government regulations like 1940, 2A7 www.sec.gov/rules/final/33-7479.txt

I've built pre-trade compliance systems using JESS, so I have a little bit of experience building real-time compliance for OMS. Most diversification rules require the system incrementally recalculate the aggregate as transactions come ine. The aggregates are basically multi-dimensional aggregates and vary between 12-20 dimensions.

"On the if Rete fits CEP/ESP I think making time and causality a first class citizen is likely to deeply impact any Rete algorithm implementation. Let's left aside clustering and near real time performance requirements, or joins to relational database and continous joins. There has been extensive research in the CEP/ESP field and also around Rete and I'd tend to argue if researchers haven't come up to a common implementation so far, this is likely because some walls were hit.Both play a key role in an EDA but I don't believe a one size fits all there."

I'll be blunt here. Very few people understand RETE well enough to implement a high performance rule engine. There's maybe 2 dozen people who know RETE well enough to implement a high performance engine. Dr. Forgy, Gary Riley, Ernest Friedman Hill, a few researchers at iLog, paul haley and a few guys who worked at ART.

the statement about enhancing RETE so that "time is a first class citizen" is untrue. In my blogged, I provide several detailed description of how one can enhance RETE to support temporal logic. The type of temporal logic used in business system is only a tiny subset of temporal logic used in AI. There isn't a consensus from AI researchers about the best way to handle temporal logic because the AI case is 100x harder to handle than the simple business cases. I have blogs that attempt to describe temporal logic and how it differs between AI and business rules.

If you want an invite to my blog, email me at Woolfel AT gamil DOT com.

Yes - same Alex and no doubt at all on that for anyone as my first name, last name, pseudo, and affiliation to Esper are explicit in both.

The fact that you don't point me to a real implementation or research papers but to some posts in a blog just makes my point: I would tend to argue convergence as not happened yet.

In all case we are pretty happy with Esper performance thus far and so are our users, and if at some time we get smart enough to understand RETE (i.e find time to study it properly - which I haven't so far) and feel it 'd bring something to our users we'll certainly work on changing our underlying implementation. We'd welcome contribution on that if you are interested.

By real implementation, what do you mean? JRules supports stream processing and so does JESS. If you want proof that RETE can handle stream processing in an application that uses Event Driven architecture, all you need to do is look at JRules and JESS.

If you want a real world application, sadly I can't provide any. I do know of systems using RETE, but that code belongs those companies. Like I said in a previous comment, I do plan to build a "real world" compliance scenario for real-time trading systems, but I haven't finished it. More accurately, I don't have enough free time to implement a full application.

I do know some firms are experimenting with ESP/CEP products to do hedge fund stuff. Actually, many firms have been building these types of systems since mid 90's. They aren't general purpose solutions.

There's no point in re-implementing RETE, when you can use JBossRules for the pattern matching. Not only do you get an efficient RETE implementation, you get support for many first order logic concepts like existential, negation, forall and collect. I haven't read the spec for StreamSql, but from the examples I've seen so far, i don't think it is expressive enough to support FOL.

Do you know if StreamSql supports existential, negation, forall and collection?

Just to put this in context, lots of us are really excited about Esper, for all the obvious reasons, but we need your help, help us convince management that this is viable contender to the entrenched vendors... Somewhat like what JBoss did to penetrate the enterprise. I'm not talking about benchmarks, that's for marketing. My app is my benchmark.

Also, updating the license to something more friendly would be helpful.

So here's the question again, how does it stack up against the coral8 guide and Stonebraker's rules ?

(c) Handle Stream Imperfections (Delayed, Missing and Out-of-Order Data)Some of the features that Esper provides particularly to deal with these issues are joins, outer joins, patterns, subqueries and data windows.

(d) Generate Predictable OutcomesYep we have worked hard to get deterministic and predictable processing under multi-threaded conditions

(e) Integrate Stored and Streaming DataEsper allow SQL queries to be placed right within the query language and provides expiry-time or LRU caches.

(f) Guarantee Data Safety and AvailabilityEsper does not currently offer a persistance mechanism for events. We are working on a HA feature set.

(g) Partition and Scale Applications Automatically, andWe are working towards these goals in the HA feature set

(h) Process and Respond Instantaneously.We achieve that through highly optimized filtering, query planning, indexing and execution and other optimizations.

Is it possible to elaborate on scalability of Esper across multiple jvms running in the same or different physical machines?If two Esper engines are running in two JVMs as part of existing applications how do you ensure that only one Esper engine picks the event? Also how do you load balance events among multiple Esper engines that are subscribed to the same events?A typical example would to download an large document.If I want to build subscribers that listen to the completion of the downloaded documents event and act upon this event, how do you ensure the events are load balanced across multiple Espers running on multiple JVMs?

I have read in this blog that Esper uses Delta-Network algorithm from Caltech. Delta-Network algorithm has many similarities with RETE, it's actually RETE++ and allows dynamic changes in the rules along with the facts. Is it safe to assume that Esper has RETE influences in its algorithm?