Friday, November 6, 2009

Null Valued Fields in a Data Warehouse

Whether or not null-valued fields should be allowed in a relational database is a classical debate.  In data warehousing, null-valued fields should be prohibited as a rule.  Not because of the open / closed world assumption in relational controversies but because of what that implies for business applications and how that impacts the usability of a data warehouse.

As Codd states, the value of NULL by itself lacks sufficient specificity of meaning.  Does it imply that a field value is simply missing for a particular record, but might yet take on value or might have previously had value; or does it imply that the field value is not applicable for that record and never will be.  (This lack of clarity creates that open/closed debate.)  For the sake of clarity in a data warehouse, where data values are expected to provide a clear, complete, and accurate representation of business information (closed world), allowing null values create inappropriate ambiguity.

Four different types of default values should be considered in place of a null value:

  • A logically accurate value
  • Well-defined "unknown" (Codd's A-value)
  • Well-defined "not applicable" (Codd's I-value)
  • Source specific default

The biggest challenges in applying this rule will be to understand the underlying business operations and make a determination of what will be the most appropriate and most intuitive values for users to interact with.  Be very careful to consider what impact those choices will have on how users will access information in the data warehouse.  If it makes simple questions more complciated to answer, then it is probably a poor choice.

Word Count: 9,401

2 comments:

  1. Raleigh's Warehouse District is a kind of crossroads, literally and figuratively. One of downtown's popular entertainment districts, Raleigh's Warehouse District. l4d survival warehouse

    ReplyDelete