Predicting Events with Logistic Regression
In earlier post, CEP by Apache Mahout via the Google MapReduce Framework and Apache Mahout: Real-Time Decisioning in the MapReduce Framework, we started to look at the Google MadReduce framework and the planned analytics of the Apache Mahout development team. In this post, we will look at the first algorithm mentioned by the Mahout team, Logistic regression.
Analytics for analyzing and modeling event data can be divided into two main categories, supervised learning and unsupervised learning. Supervised learning requires input data that has both predictor (independent) variables and a target (dependent) variable whose value will be estimated. By various techniques, the process “learns” how to model (predict) the value of the target variable based on the predictor variables.
Decision trees, regression analysis and neural networks are all examples of supervised learning. If the goal of an analysis is to predict the value of some variable, then supervised learning is the recommended approach.
Logistic regression (LR) is subcategory of statistical models called generalized linear models used for prediction of the probability of occurrence of an event by fitting data to a logistic curve. LR makes use of several predictor variables that may be either numerical or categorical and is used to predict, for example, population growth, a customer’s likelihood to purchase a product or cease a subscription, market adoption of a new technology, the growth of tumors, and chemical reactions, to name a few.
The Apache Mahout team describes LR as follows:
Logistic regression is a model used for prediction of the probability of occurrence of an event. It makes use of several predictor variables that may be either numerical or categories.
Andrew H. Karp wrote an interesting paper, Using Logistic Regression to Predict Customer Retention. In his paper, Andrew says:
[The LR] approach to “pattern recognition” or “data mining” is particularly well suited to applied statistical analyses of consumer behavior. Logistic regression models are frequently employed to assess the chance that a customer will: a) re-purchase a product, b) remain a customer, or c) respond to a direct mail or other marketing stimulus.
One question, of course, is how to integrate LR into real-time event processing to help predict interesting events, like fraud, system failure, or customer buying habits. I will address this in a future post.