Friday, January 22, 2010

Slow Processing

When I was in grade school and junior high school, I was considered to be pretty good at math.  I went to math competitions all across the state and won 1st place more often than I didn't.  In most of the competitions, time didn't matter to the score.  It was a timed test rather than a race.  In my head, though, I needed to be one of the first to finish.  Being right was fine.  Being right and first?  That was really winning!  As you'd expect, that got me in trouble every once in a while.  I'd get sloppy on a couple of questions and not win 1st place.  Going back over the tests, I'd kick myself for not simply rereading the question a second time through.

My mother was the one who kept coaching me to slow down.  Read the question a couple of times.  Check your work.  I don't know if I've ever really learned the lesson to slow down and double check my work as much as I should.  If you've been reading many of my blog posts, I'm sure you've caught a plethora of typos.  Lucky for me, Firefox usually catches my misspellings.

Same lesson applies to how we process data for data warehouses.  Consider a somewhat typical batch ETL process:
  • Process input files:
    • Surrogate key lookups
    • Field level transformations
    • Calculations
    • Mapping to target tables
  • Load files into staging tables.
  • Apply updates and inserts.
Then, after the loads are complete, run a series of validation routines to ensure that all of the data that was received was actually applied to the target tables.  Total the input by some grouping; total the output by some grouping; and compare the results.  If things don't match, then send someone off to start digging through the data.

Certainly, that load process was very efficient.  Doing all those block level operations and applying large chunks of updates all at once, rather than doing one-record-at-a-time processing gets the data loaded to the warehouse much more quickly.  Bulk operations, reducing some of the transaction management overhead and improving I/O throughput, definitely do that.  Don't forget, though, that in order to feel truly confident that the processes completed successfullly, you have to read all of that data a second time to balance out the totals.

A slower alternative is to process the data more sequentially and validate that a single row makes it to each intermediate staging area or directly into the target table.  Create a process that watches every row get transformed and loaded, and reports on exactly which records failed to load rather than merely telling you that the final insert/update somehow came out 10 rows short of what was expected.

So, it seems to me the options are these:
  • "Fast" - load in bulk very quickly; do batch validation; if totals don't match, kick out for research and troubleshooting.
  • "Slow" - process each row more methodically and kick out exactly which rows fail any piece of the process.
There are complexities to this choice, clearly, and they depend on things such as the level of complexity in the ETL, the size of the batches, the available windows for processing, system usability during processing, etc.  The most important "slow" lesson is to examine the situation you're in and make a rational decision about how to process data and validate it is loaded correctly.  Don't make snap-assumptions about the right way to ensure data integrity on any particular process.

1 comment:

  1. The conclusion is spot on, however like your childhood eagerness, I still tend to occasionally skim through the decision making process and I know it's wrong. Thanks for reminding again :)