Classification in Complex Event Processing
Following up on the excellent discussion in Predicting Events with Logistic Regression I think it is time to talk a bit about the importance of classification in complex event processing. CEP is, by definition, about detecting business opportunities and threats in real-time. It follows, that by definition, CEP is centered around classifying and discriminating complex events as either opportunity or a threat.
In earlier posts, I have often mentioned the importance of Bayesian analytics in CEP/EP. The Apache Mahout development team specifically lists Support Vector Machines (SVM), Logistic regression, Bayesian networks, Perceptron and Winnow and Neural Networks as classification algorithms.
Note: The key concept of Bayes’s theorem is that the true rates of false positives and false negatives are not a function of the accuracy of the test alone, but also the actual rate or frequency of occurrence within the test population; and, often, the more powerful issue is the actual rates of the condition within the sample being tested.
However, before diving into these methods (we will continue in future posts), let’s discuss classification a bit more.
Looking to our friend Wikipedia, statistical classification is a procedure in which individual objects are grouped based on quantitative analysis on one or more characteristics inherent in the objects and is based on a training set of previously grouped objects. We often see this form of classification in network-based intrusion detection, where neural networks are trained to baseline normal network traffic and this training set is used to classify network traffic as normal or abnormal. We see a similar application in spam detection where Bayesian networks are used to classify text as spam or ham. We like ham; spam is bad.
There is no single classifier that works best on all given problems and various tests much be performed to compare classifier performance. In a statistical classification problem, precision is the number of true positives divided by the total number of elements labeled as belonging to the class. False positives are objects incorrectly labeled as belonging to the class (ham classified as spam, for example). False negatives are objects which were not labeled as belonging to that class but should have been (spam classifed as ham, for example). Recall, in this case, is defined as the number of true positives divided by the total number of elements that actually belong to the class . (i.e. the sum of true positives and false negatives.) These ratios can be translated directly into probabilities.
Basically, it should be trivial to see that CEP problems of detecting opportunities and threats in real-time can be viewed as a classification problem where precision, recall, true positives, true negatives, false positives, and false negatives are key concepts. Most CEP classes of problems are based around optimizing the tradeoffs of falsely classifying an object as belonging to a group (for example, a false positive threat) and missing the threat altogether (a false negative). Terms like Type I error (a false positive) and Type II error (a false negative) are used to describe possible detection errors created in statistical decision processes.
In some classification problems, there can be zero tolerance for any false negatives. One example would be the threat of a nuclear strike. In this case, false negatives have far greater impact than a false positive on missile defense. Well, that might not be true if the false positive results in a counter missile strike!! This outlines some of the core challenges of detecting opportunities and threats in real time. CEP is non-trivial, but let’s not get into game theory this year.
Classification of events is critical for complex event processing. Complex events are harder to classify than simple events. This might be a good place to start when formulating a quantitative definition of a complex event.