The Bot Hunter: An Event Processing Challenge (Bot or Not)
Recently we penned The Attack of the Spiders from the Clouds where we mentioned how cloud computing infrastructures can be used to stage malicous or accidential network attacks.
Today I challenge our CEP/ESP/EP vendors (or SIs) to create the following solution to detect and block rogue bots on Apache web sites. I will install and test each submitted solution on The UNIX Forums and post the results here.
Here are some basic requirements:
- Your solution must run on Linux and be installable and configurable remotely with SSH or HTTP. There will be no physical access to the server. No exceptions.
- Preferrably, the configuration can be done with a Web-Based Interface (WBI) - a browser.
- Your solution will listen to continuous updates to the Apache2 access log, exact location configurable in your solution, and identify robots ( bots), also known as spiders, from the log.
- Your solution will provide a confidence metric, key indicator (KI), for each bot detected, from 0 to 10, where 10 indicates “absolutely a bot,” 0 is “absolutely not a bot.”
- Your solution will update the IP address of each bot and KI you identify in a file/table called, for example, ./bot_scorecard.txt where each line is an IP address of a bot, followed by a semicolon (or other delimiter of your choice) and the confidence factor, for example, 10.0.0.1;10 means that 10.0.0.1 is a bot, 100% sure.
- Your solution must compare bots detected to a file/table called, for example, ./bots_allowed.txt and ./bots_denied.txt that are in the format IP address/mask, for example 10.0.0.1/24, or 10.0.0.1/32.
- If the KI “confidence factor” of the IP address of your detected bot is higher than the tunable “is a bot” KI, then your solution should update the tables/files and then call iptables and block the bot.
- It should send an email to one or more email addresses with a message, for example: “New Bot Detected - Confidence 8″ with IP address, etc. in the message. Another example would be an email, “Bot Blocked” - with details, etc.
- You cannot automatically block any traffic that is not a bot. Blocking one “non-bot” results in failure, no exceptions.
- The Prize: The winner will get their logo (w/link) on this site in a block called “Bot Hunter Winner” (or something like that.)
These are some basic requirements; I don’t want to restrict your thinking or solution, so be creative! Feel free to ask any questions in the comment section of this thread.
Remember, sometimes you may have to manage the state of IP addresses for days, or hours, before you can accurately deterimine if it is a bot based on behavior alone. So, you will need to work with both long and short time windows. Latency is not important. Detection accurate is importance.
Note: Rogue bots do not necessarily set the User Agent field correctly. So please note that your solutions must analyze the behavior of the transactions to determine “Bot or Not”, not simply read the User Agent field from the log file.
Anyone care to submit a solution for testing?
Filed under: Adapters, Agents, Complex Event Processing, Cybersecurity, Detection Theory, Development and Evaluation, Event Processing, Event Stream Processing, Event-Driven Architecture, False Positives and Negatives, Threats and Vulnerabilities, Use Cases












Hi Tim,
It just so happened that botslist is working on a solution that is pretty close to what you describe here, except that our KI is a bit mask (check out the botcaps value description on our FAQ page) and in addition we allow filtering by custom header fields that contain other pieces of very useful information (checkout the custom headers on our search results page).
If you are still up to the challenge :), I would like for you to try out our mod-botslist solution when it’s ready in exchange for a fair (positive or negative) review.
Shoot me an email if you are interested.
Cheers
Mike