Wednesday, July 11, 2012

MapReduce Programming

I'm just starting to get into working with Hadoop and MapReduce and surprised myself with how quickly I was able to put together a real first MapReduce program.  In about four hours or programming today, I build a program that takes as input a log of events for different subjects and computes a frequency table for the time interval between consecutive events tied to the same subject.

That is, what are the typical intervals between events for the same subject?  Is there a a natural point at which activity tapers off and is sure not to return for an extended period of time?

The idea is to be able to identify how to long the minimum gap is between two separate experiences or interactions.

Here's my MapReduce Implementation

  1. Map the incoming data, partition by subject and do a secondary sort on timestamp.
  2. Reduce that down by looping through the series of timestamps, computing the interval between the current timestamp and the previous one, and counting up the occurrences of each interval (in seconds).
  3. Have the output keyed by interval.
  4. Reduce that again, summing up the separate counts by patient for a given interval into a single count of occurrences for each interval, across all patients.
  5. Map that all into a single partition and write out the data in a meaningful way.

With some existing template code, I cranked that out using MapReduce and Protocol Buffers, figured out some Maven dependencies, set up my environment to support ProtocolBuffers, and built and tested the code.  Not bad for a half-day of work.  I credit my iced mocha from Starbucks, and maybe the fact that I was sitting at Starbucks for much of that time.

Now onto something more sophisticated.  Machine Learning.



No comments:

Post a Comment