Tools

Data Aggregation and Bayes Classifiers

By Arunava Chatterjee, October 24, 2009

CEP can monitor business events and adapt business processes in a near real-time fashion.

Naive Bayes Classifiers

Bayes Classifiers have been recently used for filtering or classifying data and has been leveraged by a variety of software [7,8,9,10]. In particular, naive Bayes classifiers have demonstrated surprising efficacy in addressing real world problems [12]. A Naive Bayes classifier assumes that any pair of events is statistically independent [11]. The naive independence assumption implies a probabilistic orthogonality among data points. With this assumption it can be shown that:

Equation 3

where Z is a scaling factor. This provides an explicit formula for calculating the probability that a collection of elements correspond to a security level hk for k=[1,…,m].
In order to do classification we consider the ratio of probabilities r = P[hj|G]/P[hk|G]. In our scenario, hj may correspond to "Unclassified ," data while hk corresponds to "Classified" data. This then leads to the guess that if:

r < 1 the collection G is classified in security level hk.

r = 1 there is no conclusion

r > 1 the collection is in security level hj.

Training of the Classifier can be done in a number of ways. One typical approach would be to use Maximum Likelihood Estimation (MLE) , sometimes referred to as Fisher's Method. In this approach the free parameters of the system are estimated using the sample mean and sample variance [7].

Coded examples of Naive Bayes Classifiers can be found in texts and on the Internet [7,8,9,10]. In discussions with our CEP vendor, a Bayesian Classifier is in under development. We will therefore leverage the CEP vendor's implementation of Bayes Classifiers or else create a plug-in [6,7,13].

Classification Process

Probabilistic techniques are imperfect and cannot replace a human, especially in a venue such as security. However, the intent here is not to replace human evaluation of potential security threats but rather to reduce the problem to one where a human is involved in evaluating borderline scenarios. To achieve this, a combination of deterministic and probabilistic approaches is used.

Following the initial training of the classifier on sample data, the process is as follows:

As data aggregates are required (e.g., for reports), the CEP software checks to see if any rules exist for this aggregate.

If the rules indicate that a user cannot view the aggregate, the request is rejected and the reviewer is informed. The reviewer may allow or disallow on a case-by-case basis.

Otherwise, the aggregate is generated and the report is created.

If there are no existing rejection rules , the classifier is run against the metadata to see if the report corresponds to the expected security level.

If the report is in not in the security level expected, the requester is denied the report and it is sent to a reviewer to address on a case by case basis.

Otherwise, the report is created and continues to its requester.

In either case, the classifier adds the data aggregate to a lookup of known aggregates.

Allow/deny decisions from the reviewer are used to update the classifier

Figure 1 is an activity diagram for the process.

Figure 1: Process for incorporating Classifier

Conclusion

In this article, I've described a mechanism for using Bayes Classifiers to reduce the security inferencing problem. The mechanism employs a combination of deterministic and nondeterministic elements -- the most interesting of these being the Naive Bayes Classifier. Note that this approach is fundamentally no different from Document Classification. In particular, given the success of Bayesian techniques in document classification, it is worth consideration as a mechanism for security classification. This implementation is in its nascent stages and we have yet to consider issues such as temporal constraints on metadata or staleness of the classifier's lookup. In a subsequent article, I hope to address these issues and report on the outcomes.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!