Wednesday, December 9, 2009

assert(datawarehouse.data.is_correct());

If a man begins with certainties, he shall end in doubts;
But if he will be content to begin with doubts,
He shall end in certainties.
[Francis Bacon 1561-1626]


When I was learning to program in C and studying algorithms, the assert() assertion macro was one of my favorite debugging tools.  Assert can be used to validate that something isn't going wrong that could send your program into left field during the execution of some procedure.  For instance, a balanced binary search tree should never be more than log2(n) levels deep (or something similar to that based on the exact insertion algorithm), where n is the number of items in the tree.  After a new item is inserted in the tree, you can assert(tree.depth() == log2(tree.count())).  If that assertion fails, then you know the tree isn't staying balanced and the search performance guaranteed by a balanced tree isn't valid any more.

If that's too much computer science for you, hold on and see where this is going.  There's relevance to this idea beyond low-level programming and computer science theory.

I've been in many conversations with data warehouse sponsors that focused on the question of "how are you sure that the data in the warehouse loads correctly every night?"  One of the better ways I've found to approach this kind of data integrity assurance is to think about what kinds of assertions can be found throughout the batch ETL processes that I create.

For this example, suppose a somewhat traditional sort of ETL process that happens in the following steps:
  1. Copy or extract raw data from source system
  2. Detect changes from last pull
  3. Lookup surrogate keys and other translations
  4. Apply deletes (as soft-deletes with setting exp_date = current_date())
  5. Apply inserts
  6. Apply updates
Now, read the wikipedia article about assertions and test-driven development and think about what kinds of things you may want to be able to assert during each phase of the ETL.  (Excuse my sort of object oriented pseudo-code.)
  1. Clearly after the raw data is copied from the source system, you can test to see if you've captured all of those records in your own work area, or possibly perform a sort of checksum on the data to ensure the integrity of the copy.
    • assert(source.count() == mycopy.count())
    • assert(source.checksum() == mycopy.checksum())
    1. If you have to manually detect changes in the source data, rather than relying on source system audit information, there are other assertions that you can perform during that change detection process.
      • assert((source_previous.count() - source_current.count()) == changes.insert_count() - changes.delete_count())
      1. After looking up dimension or reference surrogate keys, you can check that you haven't lost any records and that all the reference surrogate keys have valid values (if that's appropriate for your implementation.)
        • assert(record.dimension_key > 0)
        • assert(input.count() == output.count())
        1. After deletes are applied, you should be able to test that the number of rows with a current last-updated audit timestamp and expiration date of today should match the number of rows marked for delete.
          • assert(changes.delete_count() == target.count(where last_update==today() and exp_dt==today())
          1. After the inserts are complete, the insert count can be measured using the inserted audit timestamp.
            • assert(changes.insert_count() == target.count(where  insert_audit==today)
          2. And the updates can be measured using the last-updated audit timestamp and an expiration date greater than today.
            • assert(changes.update_count() == target.count(where last_update==today() and exp_dt > today())
          3. We may also choose to assert that we aren't trying to update any records that the source systems says were deleted.
            • assert(target.count(where last_update==today() and exp_dt < today())
          Your ability to create reasonably easy to execute assertions at the batch level and at the individual record level  depends on how your data warehouse and ETL are designed, of course.  Whether working on an existing system or starting something new, look at data quality integrity assurance from the perspective of traditional asserts and see if that changes how you consider building those ever important data quality assurance jobs.

            No comments:

            Post a Comment