Wednesday, July 11, 2012

MapReduce Programming

I'm just starting to get into working with Hadoop and MapReduce and surprised myself with how quickly I was able to put together a real first MapReduce program.  In about four hours or programming today, I build a program that takes as input a log of events for different subjects and computes a frequency table for the time interval between consecutive events tied to the same subject.

That is, what are the typical intervals between events for the same subject?  Is there a a natural point at which activity tapers off and is sure not to return for an extended period of time?

The idea is to be able to identify how to long the minimum gap is between two separate experiences or interactions.

Here's my MapReduce Implementation

  1. Map the incoming data, partition by subject and do a secondary sort on timestamp.
  2. Reduce that down by looping through the series of timestamps, computing the interval between the current timestamp and the previous one, and counting up the occurrences of each interval (in seconds).
  3. Have the output keyed by interval.
  4. Reduce that again, summing up the separate counts by patient for a given interval into a single count of occurrences for each interval, across all patients.
  5. Map that all into a single partition and write out the data in a meaningful way.

With some existing template code, I cranked that out using MapReduce and Protocol Buffers, figured out some Maven dependencies, set up my environment to support ProtocolBuffers, and built and tested the code.  Not bad for a half-day of work.  I credit my iced mocha from Starbucks, and maybe the fact that I was sitting at Starbucks for much of that time.

Now onto something more sophisticated.  Machine Learning.



Saturday, July 7, 2012

On Executive Sponsorship

My main priority right now is the establishment of data stewardship groups. On the one hand, my gut tells me that these will be not effective if they are built up by mid-level functional leaders rather than through executive mandate. On the other hand, executive sponsorship sometimes gets things accomplishes rapidly.

What's the right balance? Here are my rules of thumb.

1. Never say "because such-and-such a VP said this a priority" or worse yet "because so-and-so's bonus is riding on this." I actually heard that once!

2. Always include the people who will be doing the work in the decision making process. It'll lead to better long term success.

3. Make sure the people doing the work know what old habits or old processes can (and need to) stop. If stewardship is just more work then it won't create the efficiencies it is predicated on.

What do you think?