But if he will be content to begin with doubts,
He shall end in certainties.
[Francis Bacon 1561-1626]
When I was learning to program in C and studying algorithms, the assert() assertion macro was one of my favorite debugging tools. Assert can be used to validate that something isn't going wrong that could send your program into left field during the execution of some procedure. For instance, a balanced binary search tree should never be more than log2(n) levels deep (or something similar to that based on the exact insertion algorithm), where n is the number of items in the tree. After a new item is inserted in the tree, you can assert(tree.depth() == log2(tree.count())). If that assertion fails, then you know the tree isn't staying balanced and the search performance guaranteed by a balanced tree isn't valid any more.
If that's too much computer science for you, hold on and see where this is going. There's relevance to this idea beyond low-level programming and computer science theory.
I've been in many conversations with data warehouse sponsors that focused on the question of "how are you sure that the data in the warehouse loads correctly every night?" One of the better ways I've found to approach this kind of data integrity assurance is to think about what kinds of assertions can be found throughout the batch ETL processes that I create.
For this example, suppose a somewhat traditional sort of ETL process that happens in the following steps:
- Copy or extract raw data from source system
- Detect changes from last pull
- Lookup surrogate keys and other translations
- Apply deletes (as soft-deletes with setting exp_date = current_date())
- Apply inserts
- Apply updates
- Clearly after the raw data is copied from the source system, you can test to see if you've captured all of those records in your own work area, or possibly perform a sort of checksum on the data to ensure the integrity of the copy.
- assert(source.count() == mycopy.count())
- assert(source.checksum() == mycopy.checksum())
- assert((source_previous.count() - source_current.count()) == changes.insert_count() - changes.delete_count())
- assert(record.dimension_key > 0)
- assert(input.count() == output.count())
- assert(changes.delete_count() == target.count(where last_update==today() and exp_dt==today())
- assert(changes.insert_count() == target.count(where insert_audit==today)
- assert(changes.update_count() == target.count(where last_update==today() and exp_dt > today())
- assert(target.count(where last_update==today() and exp_dt < today())