Monday, June 2, 2014

Stop Judging My Data

Data are.  Period.  Hard stop.

Data quality and data governance are powerful discussions and can have a huge impact on the ways in which data can be used in analysis.  It certainly makes things clearer when we have consistency of data entry, code sets, workflow, etc.  It occurred to me today, though, that the big data movement gives us at least one technique for dealing with suboptimal data quality.  (This is especially great for those of us who work in healthcare and like to complain about how contaminated our data environments are.)

At StampedeCon this past week, Kilian Weinberger described machine learning (a key technique in big data analysis) this way:

  • Traditional computer science takes input data and program instructions to generate output data.
  • Machine learning takes input data and output data to generate inferred program instructions.

This approach begs the question: who are we to judge the data?  If we have no program instructions to which the data is expected to either comply or produce particular results, then who are we to judge the so called data quality?  Let the machine be the judge of the data and infer from it what there is to infer, regardless of what our preconceived notions of quality might be.

Even the quality of the predictive models that come out of machine learning aren't really a judgement of data quality.  In most cases, the input and output used to train machine learning algorithms are the input and output of some other human process or more complex workflow.  If a good machine learning algorithm can't create a highly predictive model from that input and output, its probably an indication that the existing process is somewhat indeterminate.  That represents a measure of process quality, not data quality.

I may be over-reaching a bit on my desire to throw data quality arguments out the window.  It's just that I've heard data quality used as an excuse too many times in my career, when many of those cases were just a matter of not trying hard enough to understand what was really going on with the data.

Wednesday, August 21, 2013

Try Did You?

This afternoon, I got to my meeting a bit early and noticed that there was a still a crowd of people in the conference room that I was scheduled to be meeting in.  So, I waited in the hallway for a few minutes.  It wasn't long until the crowd of what seemed like fifty people started filing out of a conference room designed for fifteen!  I didn't notice anything in particular about them, but when I walked into the room they had all vacated, I noticed a bit of their presence remaining.  The room was definitely warmer than the hallway and had a touch of body odor.

I thought that maybe I could cool the room down before my meeting started.  I assumed the thermostat would be covered by a clear locked box and wasn't surprised when I found it that way.  The counter-culture hacker in me said, "maybe you can pry it off or pick the lock."  So, I figured I should at least give it try and wiggle the plastic cover.

It came right off the wall.

And I turned the thermostat down a few degrees.

And everyone in my meeting was happier for the not-so-stuffy-anymore room.

All because I questioned my assumptions and ignored my expectations of failure.  Or maybe it was my Jedi mind-powers that unlocked the plastic cover over the thermostat.  Who knows.

Saturday, August 17, 2013

Data Modelers Model, What do Data Scientists Do?

Steve Hoberman (of various forms of data modeling fame) had an article on Information Management recently that poses the question: "what is the difference between a data modeler and a data scientist?"  I think that most people who have been hearing the term "data scientist" going around realize there are certainly differences, but the article does a nice job spelling that out in a gap analysis.

This got me thinking, though, about how rarely people on BI projects play multiple roles.  It's been one of the biggest challenges we've had adopting an agile methodology, I think.  The data modeler becomes a bottleneck for developers.  Analysts have to finish mapping documents before data modelers can finalize attribute names.  If a user decides that something really needs to be many-to-many instead of one-to-many, it trickles down through data modeler, developer, reporting, and testing.

Why don't we put stronger emphasis on one person having the breadth of skills to play multiple roles on a given project?  If a requirement changes, the number of system components impacted may not be any less, but the number of people who have to understand the nature of the change could be significantly reduced.  I'm not saying that business analyst, data modeler, ETL developer, and UI developer aren't different and equally valuable skills to develop; but I am suggesting that a single person should be able to play more than one role on any given project.  I think the result would usually be leaner and more efficient projects.

The challenges - leadership trusting that team members can take on multiple roles and develop those skills, and helping team members break out of their shell and prioritize the development of new skills.

Back to Steve Hoberman's discussion -- can we change it to "what's the difference between data modeling and data science?"

Tuesday, March 19, 2013


I've been reading Macrowikimomics by Don Tapscott. Great read.

One of my favorite things is the connection I see between his description of "platforms" that people are building to leverage and focus social collaboration and the Unix Philosophy of 40 years ago. Where I am in the book, he's only just now started to reference Linux and the Open Source movement. (Shocking realization: I got into Linux in year 1996 when it was only 5 years old. Now Linux in 22 years old.)

I'm starting to draw connections into my own professional life in healthcare. I think there are home opportunities to leverage "the commons" to improve the business of healthcare. My opinion: healthcare technology and business services vendors are a huge part of the inefficiency and cost of healthcare. If we take their value proposition and make it deliver it through open collaboration instead, we have the opportunity to eliminate that waste, improve the value being delivered, and extend that value beyond only those health systems that can afford the exorbitant fees.

Let's give that a try and see how it goes. Watch for updates...

Friday, November 2, 2012

IT Strategy by Tinker Bell

In our house, Friday's are "Pizza and a Movie Night."  Tonight, my girls chose the 2008 Tinker Bell movie.  Spoiler Alert!  The plot goes like this: Bell discovers that her fairy talent is tinkering - building the tools that other fairies use to do their work and change the seasons on the main land.  Tinker Bell struggles with accepting her talent for a while.  She rejects her own talent and tries to learn the other fairies' talents instead.  Eventually, she discovers that she's really a very innovative tinker fairy.  Her lack of faith in her own talent causes her to put the arrival of Spring at risk.  Then, by embracing her talent and flare for innovation, she's able to lead the other fairies with her inventions and save Spring!  As a reward, she also discovers a way that tinker fairies can perform a valuable service in returning lost things to children on the main land during the changing seasons.

Hopefully you already see the comparison between tinker fairies and typical IT co-workers.  The tinker fairies are an invaluable support team that work behind the scenes to make sure that all the other talent fairies can do their work effectively.  They deliver a great service, but that rarely drive real change in how work is done by other fairies... until Tinker Bell comes along, that is.  She spends time understanding what the other fairies do, what their challenges are, and applies her inherent tinker-talent to fundamentally change how they do their work.  Wow!  That's and innovative IT co-worker!!

At the same time the kids were watching Tinker Bell, I was reading Business / IT Fusion by Peter Hinssen.  I can't speak for Peter, but I think Tink would make a great IT 2.0 co-worker.  She's got the talent to see how her tools and materials can come together to solve problems, and she has a great ability to understand the challenges that the business of the fairies is confronted with.  Her innovations don't just make the existing processes more efficient, but drive them to be all together different processes.

My recommendation:
Read Business / IT Fusion.
Watch Tinker Bell.
Don't tell anyone that you got your IT strategy from a Disney movie!

Wednesday, July 11, 2012

MapReduce Programming

I'm just starting to get into working with Hadoop and MapReduce and surprised myself with how quickly I was able to put together a real first MapReduce program.  In about four hours or programming today, I build a program that takes as input a log of events for different subjects and computes a frequency table for the time interval between consecutive events tied to the same subject.

That is, what are the typical intervals between events for the same subject?  Is there a a natural point at which activity tapers off and is sure not to return for an extended period of time?

The idea is to be able to identify how to long the minimum gap is between two separate experiences or interactions.

Here's my MapReduce Implementation

  1. Map the incoming data, partition by subject and do a secondary sort on timestamp.
  2. Reduce that down by looping through the series of timestamps, computing the interval between the current timestamp and the previous one, and counting up the occurrences of each interval (in seconds).
  3. Have the output keyed by interval.
  4. Reduce that again, summing up the separate counts by patient for a given interval into a single count of occurrences for each interval, across all patients.
  5. Map that all into a single partition and write out the data in a meaningful way.

With some existing template code, I cranked that out using MapReduce and Protocol Buffers, figured out some Maven dependencies, set up my environment to support ProtocolBuffers, and built and tested the code.  Not bad for a half-day of work.  I credit my iced mocha from Starbucks, and maybe the fact that I was sitting at Starbucks for much of that time.

Now onto something more sophisticated.  Machine Learning.

Saturday, July 7, 2012

On Executive Sponsorship

My main priority right now is the establishment of data stewardship groups. On the one hand, my gut tells me that these will be not effective if they are built up by mid-level functional leaders rather than through executive mandate. On the other hand, executive sponsorship sometimes gets things accomplishes rapidly.

What's the right balance? Here are my rules of thumb.

1. Never say "because such-and-such a VP said this a priority" or worse yet "because so-and-so's bonus is riding on this." I actually heard that once!

2. Always include the people who will be doing the work in the decision making process. It'll lead to better long term success.

3. Make sure the people doing the work know what old habits or old processes can (and need to) stop. If stewardship is just more work then it won't create the efficiencies it is predicated on.

What do you think?