Thursday, June 18, 2009

Aggregate / Summary

One of my wife's pet peeves is when technologists take a word that already has clear meaning in the English language and then twist the definition to mean something not-quite-close-enough in a technical context. Great case-in-point in the world of data warehousing:

What is the difference between an aggregate and a summary?

I did a Google search and came up with a lot of junk floating around what I would consider some not very good answers (albeit from 2002) from some industry analysts. In fact, David Marco goes so far as to say "Summarized and aggregation are the same thing.


In my gut, there's always been a difference between an aggregate and a summary. So, I decided to try to articulate what the difference is. In doing so, my wife's advice of "just look in the dictionary" came in very handy.


Aggregate:
Comes from the Latin word for "to flock together" or "to flock or group."
Just to point out that there's nothing necessarily in that definition about the idea of reducing the specificity of something or the fact that the flock or group is made up of individuals. Rather, the idea is that there is a group of individual things acting together.

Summary:
From the Latin summa, one meaning of which is "total" or "sum," also the "principle or main thing."
Clearly, the idea with this root word is that a level of detail is being removed when the summary of something is presented. Rather than still being individual things, the summary of things is another layer of abstraction that represents the underlying detail.


So, my conclusion is that, in data warehousing application, the term aggregate can be used to represent an object that brings together various other ideas and combines them together in one place, regardless of any change or lack of change in the level of granularity. A summary, on the other hand, has to imply either a change in granularity either through mathematical means or by eliminating the amount of precision in a series of events.


Example 1:
If I have two fact tables, one representing purchase orders and the other invoices, I can create an aggregate that still contains all the same detail, but pre-joins those fact tables together in a new kind of fact. In order for that to be a summary, I also have to roll it up to something higher than the transaction grain.

Example 2:
If I have a workflow event table that I use to track how long it takes an order to go from ordered to fulfilling to packaging to shipping to billing, I can create a summary that only has ordered and shipping status records. In this case, the level of granularity hasn't really changed, but the level of detail has, so it is a summary.


I'm open to comments on this, but it seems very straight forward given the dictionary definition of the terms. I know people may use them differently, but we're a group of data-driven experts. Shouldn't clear and precision definition of terms be something that we strive for?

No comments:

Post a Comment