KMeans Clustering Now Running on Elastic MapReduce
Stephen Green, blogger and principal investigator of the AURA project in Sun Labs, has moved the state-of-the-art of analytics-as-a-service a few steps forward with the first documented working Mahout application on Amazon’s Elastic MapReduce (EMR).
EMR was announced on April 1st and on April 15th Stephen announced to the Mahout users group that he was going to “give a talk on Mahout for our reading group and decided that I would use it as an opportunity to try Amazon’s Elastic MapReduce (EMR).” Three days later, after working out a few kinks in the S3 I/O subsystem, Stephen glowingly announced the first “successful run on EMR of the KMeans clustering of the synthetic control data.”
Stephen’s laudable integration work comes only a week after Grant Ingersoll and Sean Owen, core members of the Apache Lucene project, announced the release of Apache Mahout 0.1. This is a noteworthly accomplishment by some very talented people.
One of the things I like about the Mahout team is that they are not only talented, but they are a nice group of people, moving the state-of-the-art of predictive analytics forward in an open, collaborative and exciting way. I am sure we will be seeing more good things from this team.
In fact, I feel a bit guilty as I have been lurking and cheerleading on the Mahout lists but have not got off my lazy “you know what” and made a contribution. I did manage to hit so many golf balls I got a terrible blister on my hand as Stephen worked hard on debugging the I/O links in the Mahout/Hadoop/EMR configuration. While my golf swing has improved (not to mention my lap swimming), I feel like I need to contribute more to this very worthwhile effort and talented team.