Data quality and data governance are powerful discussions and can have a huge impact on the ways in which data can be used in analysis. It certainly makes things clearer when we have consistency of data entry, code sets, workflow, etc. It occurred to me today, though, that the big data movement gives us at least one technique for dealing with suboptimal data quality. (This is especially great for those of us who work in healthcare and like to complain about how contaminated our data environments are.)
At StampedeCon this past week, Kilian Weinberger described machine learning (a key technique in big data analysis) this way:
- Traditional computer science takes input data and program instructions to generate output data.
- Machine learning takes input data and output data to generate inferred program instructions.
This approach begs the question: who are we to judge the data? If we have no program instructions to which the data is expected to either comply or produce particular results, then who are we to judge the so called data quality? Let the machine be the judge of the data and infer from it what there is to infer, regardless of what our preconceived notions of quality might be.
Even the quality of the predictive models that come out of machine learning aren't really a judgement of data quality. In most cases, the input and output used to train machine learning algorithms are the input and output of some other human process or more complex workflow. If a good machine learning algorithm can't create a highly predictive model from that input and output, its probably an indication that the existing process is somewhat indeterminate. That represents a measure of process quality, not data quality.
I may be over-reaching a bit on my desire to throw data quality arguments out the window. It's just that I've heard data quality used as an excuse too many times in my career, when many of those cases were just a matter of not trying hard enough to understand what was really going on with the data.