Predicting Events with Logistic Regression

In earlier post, CEP by Apache Mahout via the Google MapReduce Framework and Apache Mahout: Real-Time Decisioning in the MapReduce Framework, we started to look at the Google MadReduce framework and the planned analytics of the Apache Mahout development team.  In this post, we will look at the first algorithm mentioned by the Mahout team, Logistic regression.

Analytics for analyzing and modeling event data can be divided into two main categories, supervised learning and unsupervised learning.   Supervised learning requires input data that has both predictor (independent) variables and a target (dependent) variable whose value will be estimated.  By various techniques, the process “learns” how to model (predict) the value of the target variable based on the predictor variables.

Decision trees, regression analysis and neural networks are all examples of supervised learning.   If the goal of an analysis is to predict the value of some variable, then supervised learning is the recommended approach.

Logistic regression (LR) is subcategory of statistical models called generalized linear models used for prediction of the probability of occurrence of an event by fitting data to a logistic curve.  LR makes use of several predictor variables that may be either numerical or categorical and is used to predict, for example, population growth,  a customer’s likelihood to purchase a product or cease a subscription, market adoption of a new technology, the growth of tumors, and chemical reactions, to name a few.

The Apache Mahout team describes LR as follows:

Logistic regression is a model used for prediction of the probability of occurrence of an event. It makes use of several predictor variables that may be either numerical or categories.

Andrew H. Karp wrote an interesting paper, Using Logistic Regression to Predict Customer Retention.  In his paper, Andrew says:

[The LR] approach to “pattern recognition” or “data mining” is particularly well suited to applied statistical analyses of consumer behavior. Logistic regression models are frequently employed to assess the chance that a customer will: a) re-purchase a product, b) remain a customer, or c) respond to a direct mail or other marketing stimulus.

One question, of course, is how to integrate LR into real-time event processing to help predict interesting events, like fraud, system failure, or customer buying habits.   I will address this in a future post.

Share and Enjoy:
  • Digg
  • StumbleUpon
  • del.icio.us
  • Technorati
  • Facebook
  • Mixx
  • Google
  • Slashdot
  • Furl
  • Reddit
  • Spurl
  • LinkedIn

19 Responses to “Predicting Events with Logistic Regression”

  1. Hi Tim!

    Sorry for the shameless propaganda I’m doing on your blog, but this question is addressed by SENACTIVE since a long time. Don’t know why the recognition is so low within the CEP community.

    But algorithms like you described can be easily integrated into the event processing flow as an “event service”. What we can do - or actually do is using inceremental machine learning algorithms that are attached to the event flow. With each event passing into such a service the classification model adapts itself.

    However in most cases (lets say 80%) the business requirements can be easily model by rules using advanced components like metrics or scores.

  2. Hello and Thanks for Visiting!

    To be frank, I did a google search with keywords

    Senactive logistic regression

    … and the results were ZERO useful hits.

    Perhaps you can post links or references in your future posts?

    Regarding your comment on rules, metrics and scores, my research and work indicates that rules, metrics and scores account for less than 30 percent of the overall event processing solution space. (I just made up that statistic…, but you get the idea!)

    CEP/EP is about “detecting opportunities and threats” in real time. Detection is often a problem is classification. Rules (alone) are an inefficient classification method for most complex data / event sets.

    PS: Google would be “not in business” if they used rules as their backbone classification technology, BTW.

    Thanks and Warm Regards,

    Tim

  3. I’ve thinking about map reduce for a few years now. Getting map reduce to work for real-time is going to be a challenging task. As far as I understand it, google and search engines crawl websites in batches, so it’s not done in real-time. Retrieving the results is done in real-time, but a large part of the analysis is performed off-line.

    From a pattern matching theory perspective, the system would need to intelligently divide the work, so that existential and negated patterns work correctly. Any statistical analysis would also need to be divided intelligently. In some cases a given calculation may need to be translated to an alternate form, so it can be distributed across a cluster. In some cases, the calculation might require the entire dataset, which would make dividing the task impossible.

    it’s a fascinating domain. I’m hopeful over the next 10-20 years, we will make significant progress solving these complex issues.

    peter

  4. Hi Peter!

    Great to see you.

    The issue is that event processing is not about “pattern matching” per se, generally speaking, it is about processing events, detecting opportunities and threats in real-time which is less about “pattern matching” and more about “classification”.

    One of the fallacies about CEP/EP is that it is about “pattern matching”. This fallacy is promoted by vendors selling software that does “pattern matching”, LOL.

    I have not checked with the Apache Mahout development team, but I cannot think of a reason, off hand, why processing massive datasets with a mapreduce core cannot happen in real time.

    Mapreduce is useful not only for Google’s batch classification, but also for the classification of massive datasets. Can this happen in real-time? I don’t have good answer since I have not asked the Apache Mahout development team, but I will :-)

    Yours faithfully, Tim

    EDIT: I just posted this question to the Apache Mahout development team.

  5. With off-line learning systems like logistic regression, you can definitely work with real-time events, but the learning happens in a batch process, not in real-time.

    The part that Hadoop and Mahout can help with is the off-line portion. The on-line portion of logistic regression is generally so fast and so trivially parallelizable that you don’t need to worry about fancy stuff like map-reduce.

    So the answer is yes, but not the way you mean it in your question.

  6. Btw… in real production systems such as fraud detection, it is not generally acceptable for any model to adapt in an on-line fashion. Any model changes have to be extensively tested and verified to avoid disastrous surprises.

    This is a business requirement that is pretty robust even in the face of arguments of improved performance. The perceived risk is simply too large to stomach. This perception comes from a sober assessment of history where experience shows that even carefully vetted models built using off-line methods (which are easier to get right than on-line models) do not always improve performance and sometimes decrease performance when deploy on real decision traffic.

    In addition, it is common for there to be constraints on model behavior that are very difficult to encode into a learning algorithm whether off-line or on-line.

    There are many other applications where on-line learning can plausibly be used (think spam detection), but these are generally applications that do not have significant business rule or regulatory components. It is also surprisingly common for on-line learning to have little or no performance benefit compared to relatively frequent off-line updates.

    Off-line updates also have the advantage of being amenable to techniques such as map-reduce. The key benefits of map-reduce are not simply parallelism. The first benefit is that almost all access to disk or memory is highly sequential in nature. This can result in several orders of magnitude in performance improvement. A second benefit is that map-reduce programs are typically nearly scale-free. This means that higher performance can dialed in at run-time. Off-line updates in many cases also provide better convergence properties which leads directly to compute savings.

    Overall, then, the situations where on-line learning is clearly better are really pretty limited.

  7. Hi Ted,

    Thanks for your reply. Excellent and very informative, illustrating why the event processing community needs people like you to help further the state-of-the-art.

    Basically, if I understand you correctly, mapreduce is useful for the off-line learning process that is required to build the models that would be used in a real-time event processing application.

    You further state that that the economies of scale from mapreduce would be overkill for real-time processing; for example, you site real-time logistic regression as being trivial in parallelism, so there would be no benefit of using Hadoop/Mahout for real-time, on-line, logistics regression (for example).

    Basically, you are saying that the off-line learning process (model building) from processing massive datasets with mapreduce provides the greatest benefits to on-line event processing, if I understand you correctly.

    Thanks! I think I got it!

    Yours faithfully, Tim

  8. I should clarify what I meant by pattern matching. I was thinking of a situation where there is a known pattern, which the system has to detect in real-time. That pattern could be described by a mathematical model, bayesian network, kalman filter, rules or a combination of all the above.

    Some patterns lend themselves to parallelization, while other don’t. The part that interests me is figuring out a formal approach for handling the situations where we can divide the work. If a particular model requires the entire dataset to classify the event, it’s difficult no matter what.

    From a high level perspective, processes that fit into the Single Instruction Multiple Data (SIMD) model should fit nicely into map-reduce. In some ways, distributed collaborative agent approach uses the same ideas.

    peter

  9. Hi Peter,

    Thanks for the clarification.

    However, I would like to point out that most “interesting” classification problems are not really “pattern matching”, where there is a known pattern that is matched. Rather, the event-data is classified in real-time, with some probability of classification. I would not call this “pattern matching” per se. I think that is why the broad category is “classification”, where the process is often based on a statistical analysis of the data, not specific matches to pre-existing patterns.

    Perhaps we are simply discussing semantics?

    “Pattern matching” seems to be a term used more by “the rules crowd” or the “query crowd”, as I do not see this term being used so much by the “statistical analytics crowd”.

    For example, I would not call a Bayesian Classifer a “pattern matching algorithm” and I don’t think the literature refers to Bayesian Classifers at “pattern matchers” either.

    (I need to check this, but Google says I’ve Googled too much and they have blocked me, LOL)

    Yours faithfully, Tim

  10. This is a minor clarification at best, so feel free to ignore…

    It might be worth noting the difference between (1) what Machine Learning people call an “on-line” algorithm, and (2) updating a (e.g. logistic regression) model in real-time. It sounded like the OP was talking about #2. An on-line ML algorithm is one with constant memory overhead which trains by seeing a single stream of the data example. Real-time updating of a batch algorithm (like logistic regression) will store the entire data set. One can achieve real-time updates with a batch algorithm like logistic regression by performing a few gradient descent (like) updates for each new example. However, to satisfy the “real-time” goal, it is likely necessary to be able to store the entire example set in memory. Even then, it could be slow if your # of examples is in the millions. It’s not java, but python scipy.optimize provides some nice, fast, general optimization routines (such as conjugate gradients and l-bfgs).

    Jason

    Jason Rennie
    Research Scientist, ITA Software
    http://www.itasoftware.com/

  11. Dear All,

    One of my take-a-aways from this discussion is the need to discuss the various problems in the event processing domain vis-a-vis on-line, off-line and hybrid approaches to learning and updates.

    This has been an excellent discussion and I would like to thank Ted, Peter and Jason for their comments.

    More discussion welcome!

    Yours faithfully, Tim

  12. I agree that many of the most interesting classifications don’t fit in the old pattern matching definition. In my studies, I’ve been looking at kalman and markov techniques, which are statistical/mathematical approaches. In the case of supervised training, many of the papers I’ve read have difficulty explaining how the system produced the model. In some of the case base reasoning papers I’ve read, there’s a set of metarules that guides the the system. As the system learns and adapts, the metarules are augmented and altered to improve the accuracy.

    I think all of these techniques “could” work together to make systems more adaptable, but a lot of research is still needed to make them practical. I’ve enjoyed the discussion.

    peter

  13. Hi Peter,

    I enjoyed this topic as well. I plan to write more on this topic, and similar topics, in the near future. Your comments are always warmly appreciated and very insightful.

    Yours faithfully, Tim

  14. Dear Tim,

    Fraud is a highly sensitive topic - officially it is not a problem ;-)

    Maybe you have noticed that Senactive is doing fraud detection and prevention for some time now and we just recently expanded into the financial transaction (credit-card) area where we can show off a fraud detection performance of 91% true-positives. This is done in real-time so there is no post evaluation after your money has gone. And we talk here about transaction sizes in billions!

    There are three approaches that we go in fraud detection:

    a) Model situations with rules using scoring and metrics that are calculated on the fly in order to do some dynamic adoptions.

    b) Applying data mining techniques (various algorithms - depending on the use case) and extracting classification models that are then applied, by “event services” on streams of events.

    The advantage of this approach is that you have a model whose performance can be statistically evaluated and predicted (with a certain probability of course). However these classification models need to be reevaluated (i.e. recalculated and verfied) in regular cycles.

    c) Applying machine learning algorithms on-line during event processing. It has the advantage that the classification adopts itself fast and dynamically, but also might go into the false direction. It is also highly resource consuming.

  15. Thanks daSeep for sharing your experience. That’s very close to what I was thinking, but couldn’t find the right words. I know Mindbox has used similar techniques in the past with their CBR engine. One area that really interests me is finding a formal method of validating classification models. I’m curious, what kind of algorithm do you use to evaluate the classification models?

    peter

  16. I’m not sure if I understand your question right.

    What do you mean by validating a classification model?

    - Validating which algorithm to use?
    - or Validating the performance of a generated classification model applied on events (like financial transactions)?

  17. sorry for being obtuse and unclear.

    Ignoring how the model is created, what method(s) are used to validate the model before deploying it to production. Say I take a finite set of data and use machine learning to produce a model. How is it validated to measure the effectiveness? Do you run a series of positive and negative cases through the system and measure false positive and false negative?

    Once it’s deployed into production, as the system alters and adapts the model to changing input, how does the system validate the changes to the model are good? Does it run through a set of test data or do you use statistical model validation?

    In the past, I’ve mainly used predefined test cases with test data to validate the system. Clearly that doesn’t scale well for dynamic environments like the scenarios that Tim mentions.

    I imagine a situations like network intrusion detection the attack patterns changes rapidly. If we use bayesian filter to train a system and produce a model, it would only be effective for a short period of time. Once the attackers change their strategy, the model could be invalid. For a system to work well, it would need to continuously adapt and update the model. If the system doesn’t validate the runtime model changes, the accuracy could decrease.

    peter

  18. Hi Peter,

    What we do before deploying a model to production depends on the domain and the algorithm which is used. These decisions are made by our specialists. ML algorithms tend to overfit training data which leads to superb results in validation - however might perform very bad on new events. Therefore we separate training data and do heavy cross-validation tests while generating models.

    What we usually measure is the accuracy, precision and recall of the model. Thats the standard in ML.

    However we do also backtesting from a business perspective and analyze also monetary attributes. There are certain cases for instance where you can’t risk to have false-posities.

    This whole topic of algorithm selection

  19. Dear Peter and daSepp,

    This is a great discussion that hints at some of the core issues.

    Production models must adapt (learn) in near-real-time for many applications. Hence, there is the process of learning in real-time, as their is the process of validating the model and then updating the production systems.

    One size does not fit all, as deSepp mentions. Some appications cannot tolerate a single false positive; other applications cannot tolerate a single false negative. Many are somewhere in between.

    Hence, as mentioned in my event processing keynote in March 2006 (see below), we need to map the applications (and their requirements) to the appropriate analytics and architectural approach.

    This top-down approach of mapping applications to analytical requirements and architecture is critical if the state-of-the-art of event processing is to move forward. For example, see slides 26 - 28 of this March 2006 presentation:

    http://www.slideshare.net/TimBassCEP/processing-patterns-for-predictive-business-presentation

    Clearly, we need to map the solution space into the problem space, as illustrated in slides 26-28.

    Yours sincerely, Tim

Leave a Reply

Copyright © 2007-2008, The CEP Blog, All Rights Reserved.