Sharpening Stones: MapReduce Programming

Wednesday, July 11, 2012

MapReduce Programming

I'm just starting to get into working with Hadoop and MapReduce and surprised myself with how quickly I was able to put together a real first MapReduce program. In about four hours or programming today, I build a program that takes as input a log of events for different subjects and computes a frequency table for the time interval between consecutive events tied to the same subject.

That is, what are the typical intervals between events for the same subject? Is there a a natural point at which activity tapers off and is sure not to return for an extended period of time?

The idea is to be able to identify how to long the minimum gap is between two separate experiences or interactions.

Here's my MapReduce Implementation

Map the incoming data, partition by subject and do a secondary sort on timestamp.
Reduce that down by looping through the series of timestamps, computing the interval between the current timestamp and the previous one, and counting up the occurrences of each interval (in seconds).
Have the output keyed by interval.
Reduce that again, summing up the separate counts by patient for a given interval into a single count of occurrences for each interval, across all patients.
Map that all into a single partition and write out the data in a meaningful way.

With some existing template code, I cranked that out using MapReduce and Protocol Buffers, figured out some Maven dependencies, set up my environment to support ProtocolBuffers, and built and tested the code. Not bad for a half-day of work. I credit my iced mocha from Starbucks, and maybe the fact that I was sitting at Starbucks for much of that time.

Now onto something more sophisticated. Machine Learning.

No comments:

Post a Comment

paulboal & Sharpening Stones

I'm a data professional who's worked in both consulting and in corporate positions, and I'm passionate about making the business world a better contributor to society through the intelligent use of business information. Welcome to Sharpening Stones, a blog about information management, data, business intelligence, a data warehousing. The title and domain name for this blog are an homage to the REM song, Exhuming McCarthy: You're sharpening stones, walking on coals, To improve your business accumen. You can see my interpretation of those lines in the first posts by following the links from here.

Wednesday, July 11, 2012

MapReduce Programming

No comments:

Post a Comment

paulboal & Sharpening Stones

Blog Archive

A Practical Data Warehouse