Real-Time Predictive Analytics for Web Servers
We recently made the decision to move to Zabbix to monitor one of our busy production Apache web servers. One of the things we need to do in the future is try to predict system outages and take corrective actions before the system actually goes down.
For example, recently a busy server experenced an outage that appeared to be caused by either a kernel bug or a cyberattack based on the Treason Uncloaked! TCP issue. The events leading up to the outage were so severe that our server logs and system stats halted before the outage occurred. The situation was complex and we still don’t know exactly what caused the problem.
This is a good opportunity for us to experiment with some real-time predictive analytics. So, after we get our agents and logfile monitoring extentions configured to gather the required event data, such as logfile entries, cpu stats, open file descriptor stats, open sockets, spiders on site stats, etc., we plan to move to the next step.
Our vision for the next step is to feed production web server and network events into either a neural or Bayesian network and build a baseline of normal patterns and then see if we can use open source (free) predictive analytics to help us prevent future outages by alerting, ahead of time, to intervene.
Any ideas?
5 Comments
One Trackback
-
[...] server and then we started to look into adding predictive analytics afterwards. Alberto recommended we look into The R Project for open source predictive analytics, which was interesting because I [...]








Otheus says:
Monday, March 2, 2009 at 9:06am
Don’t you need a *bunch* of servers to gather such a baseline?
Tim Bass says:
Monday, March 2, 2009 at 9:16am
Good question, Otheus.
Thanks for visiting and posting.
Yes, I agree that observing stats from a cluster of servers creates a type of “social baseline” for web servers, hopefully with similar traffic patterns. This is an optimal, future approach that has considerable merit, especially with shared analytics in the clouds.
On the other hand, creating a baseline with a single server with a lot of traffic, and updating this model over time, also has considerable merit.
So, the question I have for readers is this:
Will be structure for the model in this particular application be supervised or unsupervised learning?
Yours faithfully, Tim
Roland Dobbins says:
Monday, March 2, 2009 at 11:26am
I highly recommend the export, collection, baselining, and analysis of NetFlow telemetry from your network infrastructure. There are commercial (and now one open-source) NetFlow-based anomaly-detection systems on the market, and one can correlate the network statistical and behavioral information with server and application telemetry in the service of traffic engineering, performance management, security, etc.
Tim Bass says:
Monday, March 2, 2009 at 11:32am
Hi Roland,
This project needs more than network info, in fact, most of the critical information comes from the server, not the network.
Questions:
(1) Does NetFlow have free agents for the server? (The last I checked, it did not)
(2) Is the NetFlow management console free, open source? (The last I checked, it did not – it is a commercial Cisco product.)
(3) Can you provide links to the open source anomaly-detection engines that work with NetFlow so we can evaluate and comment?
Thanks,
Yours faithfully, Tim
Alberto says:
Monday, March 2, 2009 at 4:05pm
Real-time predictive analytics in a high-volume environment is a difficult issue. If you want to experiment with an open source predictive analytics I would suggest R (http://www.r-project.org/) software. Please be aware that this tool has limitations and it is not designed for real-time predictive analytics. On the other hand, it would allow you to play different scenarios that will determine which variables will be important in your real-time predictive analytics.
For actual implementation in real-time, my recommendation is to use an appliance like Netezza and build a series of control charts with alarm scripts using a preditive modeling technique like linear or logistic regression. Ultimately, what you want to predict (i.e., the definition of the problem) is the Z-score (how many standard deviations away from the mean, or think in terms of dispersion), the predictive analytics system needs to let you know in advance before the actual event occurs. This technique has been sucessfully used in real-time predictive analytics in the chemical and manufacturing industries for a long time.