Following up on KMeans Clustering Now Running on Elastic MapReduce, Stephen Green has generously documented the steps that was necessary to get an example of k-Means clustering up and running on Amazon’s Elastic MapReduce (EMR) on the Apache Lucene Mahout wiki.
As a side note, there has been considerable discussion about how MapReduce is primarily useful for processing batch data. However, considering how easy it is to upload data to S3, it is a small leap of the imagination to visualize how we can upload real-time event data from myriad sources and process that data in near real-time (and process complex events) using EMR.
On the other hand, if Amazon’s EMR implementation proved to be overly restrictive for a CEP-type of application, it might be necessary to build our own Mahout/Hadoop/MapReduce Amazon Machine Image (AMI). Stay tuned.
Maybe some of our FSI readers/gurus can port (install) some event handlers over to EC2 and provide us with a public AMIs to experiment with?