Following up on KMeans Clustering Now Running on Elastic MapReduce, Stephen Green has generously documented the steps that was necessary to get an example of k-Means clustering up and running on Amazon’s Elastic MapReduce (EMR) on the Apache Lucene Mahout wiki.

Mahout on Elastic MapReduce by Stephen Green

As a side note, there has been considerable discussion about how MapReduce is primarily useful for processing batch data.   However, considering how easy it is to upload data to S3, it is a small leap of the imagination to visualize how we can upload real-time event data from myriad sources and process that data in near real-time (and process complex events) using EMR.

On the other hand, if Amazon’s EMR implementation proved to be overly restrictive for a CEP-type of application, it might be necessary to build our own Mahout/Hadoop/MapReduce Amazon Machine Image (AMI).  Stay tuned.

Maybe some of our FSI readers/gurus can port (install) some event handlers over to EC2 and provide us with a public AMIs to experiment with?

Note: Amazon Elastic MapReduce Developer Guide (API Version 2009-03-31)