My mother was the one who kept coaching me to slow down. Read the question a couple of times. Check your work. I don't know if I've ever really learned the lesson to slow down and double check my work as much as I should. If you've been reading many of my blog posts, I'm sure you've caught a plethora of typos. Lucky for me, Firefox usually catches my misspellings.
Same lesson applies to how we process data for data warehouses. Consider a somewhat typical batch ETL process:
- Process input files:
- Surrogate key lookups
- Field level transformations
- Mapping to target tables
- Load files into staging tables.
- Apply updates and inserts.
Certainly, that load process was very efficient. Doing all those block level operations and applying large chunks of updates all at once, rather than doing one-record-at-a-time processing gets the data loaded to the warehouse much more quickly. Bulk operations, reducing some of the transaction management overhead and improving I/O throughput, definitely do that. Don't forget, though, that in order to feel truly confident that the processes completed successfullly, you have to read all of that data a second time to balance out the totals.
A slower alternative is to process the data more sequentially and validate that a single row makes it to each intermediate staging area or directly into the target table. Create a process that watches every row get transformed and loaded, and reports on exactly which records failed to load rather than merely telling you that the final insert/update somehow came out 10 rows short of what was expected.
So, it seems to me the options are these:
- "Fast" - load in bulk very quickly; do batch validation; if totals don't match, kick out for research and troubleshooting.
- "Slow" - process each row more methodically and kick out exactly which rows fail any piece of the process.