Zabbix
Zabbix

In A Review of Zabbix – Zabbix Rules! (Part 1) we provided a brief introduction to Zabbix in the context of network and security management.  In this post I will discuss Zabbix as an event processing platform.

Zabbix is like most event processing platforms.  Zabbix provides both agent-initiated events as well as server-requested events.  In the network management world these capabilities are often referred to as “trapping” and “polling”.   No event processing software architecture is complete without both capabilities and Zabbix has both.

What is interesting about Zabbix is that Zabbix provides a simple way to send event information to the Zabbix event server.   Zabbix does this using a simple callable agent program called zabbix_sender(), which basically takes the hostname of the agent, the hostname (or IP address) of the event processing server, a unique “key” for the event ( for example, apache2.totalworkers or linux_server.loadavg15 ) and, along with the value of the event or event precursor, sends that information to the event server.

usage: zabbix_sender [-Vhv] {[-zps] -ko | -i <file>} [-c <file>]

Options:
-c –config <File>  Specify configuration file
-z –zabbix-server <Server>  Hostname or IP address of ZABBIX Server
-p –port <Server port>  Specify port number of server trapper running on the server.
-s –host <Hostname>  Specify host name. Host IP address and DNS name will not work.
-k –key <Key>  Specify metric name (key) we want to send
-o –value <Key value>  Specify value of the key
-i –input-file <input_file>  Load values from input file Each line of file contains:  <zabbix_server> <hostname> <port> <key> <value>
-v –verbose  Verbose mode, -vv for more details

Other options:
-h –help  Give this help
-V –version  Display version number

In a nutshell, event processsing architects can add the zabbix_sender() call to just about any program to send events to the Zabbix server.  We use zabbix_sender() in production to send over 170 MySQL events and KPIs to the event server. In addition, we use zabbix_sender() to send Apache2 events, as well as events and KPIs from the web application itself.

On the server side, Zabbix event “receivers” (for a lack of a better term) are defined by XML files.   These files are standard XML templates that specify a “Zabbix Trapper” and some basic parameters about that event (or event precursor), including the unique key that identifies the event.

Of course, there is also the standard “get the event” polling capability, specified in a Zabbix XML template with a corresponding event key listed in the zabbix_agentd.conf file.  In this file,  the event polling interval is specified along with other parameters that tell the server how log to keep historical data.  Below is an small excerpt from a template I wrote for Apache2 events and KPIs.

<zabbix_export version=”1.0″ date=”18.03.09″ time=”06.36″>
<hosts>
<host name=”Template_Spider_Monitors”>
<useip>1</useip>
<ip>10.1.1.1.</ip>
<port>10050</port>
<groups>
<group>Templates</group>
</groups>
<items>
<item type=”0″ key=”spider.googlebot” value_type=”0″>
<description>GoogleBot Hits Per Second (Delayed Two Minutes)</description>
<delay>120</delay>
<history>90</history>
<trends>365</trends>
<units>Per Second</units>
<formula>1</formula>
<applications>
<application>Spiders</application>
</applications>
</host>
</hosts>


</zabbix_export>

In the XML blurb above, we are receiving event-precursor information that is derived from the Apache2 logfiles.  We preprocess the logfile to count all the hits by GoogleBot within a 1 minute window to determine the GoogleBot hit-rate per minute.  Then, on the Zabbix server we configure a derived event when the number of GoogleBot hits per minute exceed a specified value.

Why do we do this?  Because a production web server depends on GoogleBot (we also track other spiders, both specific and generic) to crawl the site for optimal search and retrevial visibility.  However, sometimes GoogleBot can jump from 50 hits per minute to thousands of hits per minute, and this can seriously effect CPU and network utilization on any production server.    We need real-time situational knowledge of what is happening in the “event cloud.”  We can also look at various graphs of this “spider data” along with myriad other graphs of event-related data (CPU and network load, users on the site, free memory, etc.) to contiually optimize the server.

At this point in time, we are doing the same types of basic correlation that the current software marketing itself (well, actually the software does not market itself quite yet, people do the marketing!) as CEP does; we are running rules across a stream of events and creating derived events and actions based on conditions (typical ECA types of rules).   This is really basis stuff, of course, and we get all of this for free with Zabbix.

What we are missing, of course, are predictive analytics so we can detect complex events, for example, events that would certainly lead to a server outage.   We could build models and implement myriad rules; however, experience teaches us that a pure “expert systems” approach is too time and resource consuming (and the models change so often, that it is like a cat chasing it’s tail).   Building a system that can predict outages based only on rules is inefficient and suboptimal.   To accomplish what we desire, we need software to baseline the “event characteristics” of the system, a machine learning algorithm or two, that will “listen” to the myriad events in our “mini event cloud” and create a baseline of what are normal ebbs and flows, peaks and valleys, and other anomalies.  Then, we need to teach the system to sense-and-respond to abnormal events that require immediate actions.  This is “the hard part” and, at least in my mind (as well as others in this field), is what makes event processing “complex”.   The GoogleBot situational model we are using today is still too simple.

Zabbix already has a rule-based core.  We can look at various patterns of events over specified time windows and correlate these with other “event streams”.  However, with a database of events going back months, week, and even years, we need software to continually mine the database to create the network intelligence we need to create prediction models.  One concept I have for this is to use MapReduce/Hadoop/Mahout for the offline processing (in the future, depending on the state of Mahout moving forward).

This is why, as I wrote in my 2000 ACM paper on multisensor data fusion and intrusion detection, in a real-time (complex) event processing system, we need both offline data mining (machine learning) and online processing (real-time event detection) working together in tandem to detect complex events.

Zabbix is not yet at the stage of development that I would call it a “CEP engine”; however neither are any of the other so-called “CEP engines” on the market.   Zabbix does a better job at basic network-centric event correlation, out-of-the-box, than any “CEP engine” we tested.  That is why we picked Zabbix over other products.      It was easy to extend Zabbix to monitor any event (or event precursor) we wanted to monitor and to create basic rules as well as event correlation models.

And best of all, Zabbix is free!

2 COMMENTS

  1. Like so many great FLOSS products, Zabbix has everything except usable, task-oriented documentation. You can find a dozen “howtos” that show you how to install the thing. But that’s the part that’s so easy it doesn’t need a write-up. Then you’re confronted with a GUI with dozens of tabs with one word labels, and a 350 page manual translated from the Latvian, with a million details and no real overview. FLOSS authors, please stop writing new code for a year, and give the writers a chance to catch up with what you’ve done. HOWTO bloggers, let’s have fewer screen shots and more explanation of what to look for and what it means.

Comments are closed.