Monday, December 7, 2009

Data Quality - A Family Affair

Grandma's lesson about taking responsibility for data quality.

When I was a young child, we spent every Thanksgiving with my paternal grandparents in Denver.  There are two particularly memorable things about those visits.  First, even into the late 1980's, my grandparents didn't own their own telephone.  They rented their phone from the telephone company.  It was the same rotary dial phone they'd had for years, hanging in their kitchen, with an extra long handset cord attached so they could stretch across the dining room or kitchen while still talking on the phone.  Second was the important lesson that I learned about doing dishes by hand.

Doing dishes by hand is ideally a three person job: one to wash, one to rinse, and one to dry.  The lesson that my grandmother taught me about washing dishes was that the drier is the person accountable for making sure the dishes were clean when they went back into the cupboard.

As data warehousing professionals, we spend a fair amount of time and energy arguing that data quality is something that has to be fixed up stream, by applications.  My grandmother would insist that sending the dishes back to the washer is not our only option.

If a dish comes to the drier not quite clean, there are three options:
  • send the dish back to the washer to be cleaned again from the beginning with soap;
  • send the dish back to the rinser to have the mess rinsed off with some hot water; or
  • use a little extra effort and wipe off the mess with your dish cloth.

Ideally the dishes come to us clean and ready to dry.  It's a lot less work to dry off some steaming droplets of water and put a nice clean warm dish away in the cupboard than it is to notice that little bit of bread from the stuffing that didn't quite get cleaned and have to use the tip of your fingernail through a dish cloth to get the crumb off.

What are the downsides of sending the dish back through to be rewashed from the beginning:
  • the washer has to stop in middle of scrubbing that big pan to rewash the plate;
  • the plate takes longer, overall, to be rewashed, rerinsed, and redried;
  • both the washer and rinser have to redo work.
Perhaps the same is true in terms of data quality.  If a transaction moves from system to system and doesn't come out the other end quite exactly clean, because some of those business processes in the middle aren't quite exactly flawless, is it always the best choice to go back to the beginning to find just where things went wrong and correct them there?

I'm not suggesting that any application is allowed to be intentionally lazy about data quality, or should not correct issues that are identified.  Rather, I'm suggesting that we make sure we all continue to see data quality as our responsibility and not merely blame up stream systems when there is something that could be done at various points in the chain to ensure quality information is used for decision making.

1 comment:

  1. Excellent dish washing analogy for data quality!

    All too often, data quality is posed on a binary problem, meaning choose between only two choices:
    (1) Find and fix the problems with existing data after it has been extracted from its sources (i.e., data cleansing)

    (2) Prevent errors at the sources where data is entered or received, and before it is extracted for use (i.e., defect prevention)

    This binary problem is exponentially complicated by the fact that in reality data is moving from just one source to one target. As it flows from system to system throughout the entire information architecture, quality issues could be introduced as data is constantly being manipulated by each business process in order to serve its particular needs.

    Going all the way back to the beginning might not make any sense, especially when the data quality problem was created “in transit.” And even pushing “bad data” back to the previous link in the information chain might not make any sense, because what is “bad” from the perspective of the receiving application is perfectly usable to the sending application.

    This is one of the reasons that I always argue it is not an “either/or” (back to the two choices above), but always a “both” when it comes to data quality – and data quality has to happen wherever it is needed.

    Best Regards,