Wednesday, December 30, 2009

Who's data is it? (Part 2)

In the first part of this series on who's data is it, I talked about the black-box mentality that some teams have with regard to vendor applications.  In that scenario, the hurdle to overcome is primarily one of convincing the right authorities to grant you database access.  That's the biggest hurdle, assuming the application data is actually understandable once you get at the underlying database.  Unless the application team is more loyal to the vendor than to your company, you haven't burned any important bridges in that scenario.  The vendor won't support your efforts at extracting and integrating data from the application this way, but they're the one's who put out a ridiculously high $250,000 bid anyway.  The next scenario is more politically charged, however...

It's MY data!

The "it's my data" culture sees the information behind an application as something that needs to be closely held and funneled through a controlled group of experts.  Unlike the vendor-data perspective, the my-data group is interested in sharing information outside of the application interface, but often with point-to-point interfaces that they can tightly control (as opposed to any kind of pub-sub or hub-spoke model that creates a more open access model).  They're interaction often sounds like: "You provide exact specifications with which specific fields you want and we'll send you an extract with exactly that information."

Another common response from application teams to a request for access to an application database might be something along the lines of "what are you going to do with that data?  We have responsibility for that data and it can be very complicated to understand."  This kind of response is indicative of a belief that the application team has some kind of exclusive ownership of the data and should be the only group allowed to parcel information out to users individually through reports or extracts.  At the extreme, it's reminiscent of early reporting days where a user would call someone in IT for a report; that IT person would build the program to generate the report; print the entire report on green-bar paper; and then send it through interoffice mail to the requester.  If the application team isn't already on board with a service oriented approach to exposing data to other applications or integrated reporting through data warehousing, then their natural tendency will be to create a plethora of individual point-to-point solutions across the application landscape.

This kind of architecture may lead to well defined standards and definitions of data, but either only in pockets across the organization or with an ever increasing cost of maintenance as the number of interfaces expands and complexity of that interaction increases.  There are some positive aspects to a culture that wants to protect the purity of its data and have a strong understanding of how other groups are using the same data.  Part 3 of this series will touch on that data purist counter-point.

Who's data is it? (Part 1)

The problem of data ownership:
I want to thank Jim Harris for his great post about The First Law of Data Quality from earlier this month. It's certainly an excellent read.  One of the things it reminded me of is the experiences that I've had working with various other business and IT teams when trying to get access to and understand data from particular systems.  The issue of "data ownership" always seems to play an antagonistic role in any desire to acquire and integrate information from multiple systems.  Generalizing, I think there are three different negative responses you can receive when looking to a new source system for access to their data for integration into a data warehouse:
  • That's the vendor's data.  We wouldn't be able to understand it.
  • That's my data.  I'll tell you what you need.
  • That's not my data.  I don't know anything about it.
This series of blog posts will explore each of those in more detail:

It's the vendor's data...
The "it's the vendor's data" culture has a black-box view of applications: there is no separate understanding of application and information.  So, there's a belief that the only way to interact with the data in the application is have a vendor or partner create new application features or vendor supported interfaces.

Individuals who have experience in supporting the end users of applications and sometimes the administrative configuration of applications often put significant weight in the features and functionality of the application itself.  From an integration perspective, this significantly limits the kind of integration that can be done with many applications.  For instance, most applications provide users with various types of "report writing" functionality.  In many applications, this is merely an extract-generation feature that allows users to select the fields they're interested in and exporting that data to a CSV or Excel file.  Application support analysts who are used to supporting users in this way and have, maybe, not recently been involved in technical integration work may default to a view that the only way to get information out that particular application is through these manual extract tools.  Of course, having individuals manually run extracts from a system on a daily basis is not an ideal pattern for sourcing data into a data movement process.

In situations of heavy application reliance, application teams might never think of doing integration directly from the database; or the team might feel that any such endeavor would require many hours of training from the vendor.  Of course, the vendor would appreciate the professional services or training for that.  In most applications, however, that is typically unnecessary.  Reverse engineering a data model is an easy exercise and, depending on the complexity of system.  Reverse engineering the underlying data typically takes more analysis, but having access to both the application and database (in several environments) provides a powerful way to understand both the application and the data.

Enterprise scale applications typically provide some time of standard interfaces, whether those are X12, EDI, HL7, or a proprietary way of moving data in and out of the system.  Many modern applications provide a service oriented API to allow external applications to do read or write operations to the application.  Standard interfaces and services are an excellent way to retrieve data for a data warehouse.  In lieu of sufficient interfaces, though, database to data warehouse integration is still a common, powerful, and appropriate integration model.

Convincing application teams that connecting directly to an application database without engaging and paying for professional services from the vendor can be challenging.  One effective way to break through part of that cultural difference is to do a proof of concept and show how the technology and analysis can work to create custom integration solutions without necessarily needing vendor intervention. A quick win to prove that the mysterious data behind some application really is just letters and numbers goes a long way to gaining confidence more substantial integration work.

Saturday, December 26, 2009

Where to Put Business Logic


Dave Rathburn has a great blog about SAP / Business Objects and posted in October about choices in where to put business calculations in the data warehouse / reporting stack.  I thought I'd share a couple of quick thoughts about some experiences I've had with project teams trying to decide where to put certain pieces of business logic, specifically in the ETL workflow.

For this example, the project involves the integrated reporting of accounts receivable history from about a dozen separate legacy systems.  The design question came up: how much business logic should be done by the legacy source systems who are writing the extracts using proprietary tools, and how much should be done by the ETL team who will be integrating and merging the data into the warehouse.

In the beginning of the discussion, we had two polarized and protectionist perspectives:

Argument 1: Since the source system teams are subject matter experts, they'll know quickest how to understand and translate from source system specific domain value and logic into data warehouse values.

Argument 2: It's the data warehouse team's job to do the integration work, so the logic should all go into the ETL.

The discussion degenerated for a while until I noticed that what we were really revolving around was a straightforward operational argument that is the most convincing that I've ever heard.  Here's how we summarized it at the end of our discussion.
 

General data warehouse design goal: put as little business logic as necessary in the source system extracts themselves.  Rationale...
This helps to reduce the number of places you have to look to troubleshoot when something goes wrong.  If most of the business logic is applied after the data is retrieved from the source system, then it's also most likely that any errors that occur happen after that initial extract step.  If there is significant business logic in both the source system extract as well as the integration ETL, then both places have a reasonable likelihood of introducing errors into the data.  Some business logic must exist in the integration ETL to be able to merge common dimensions and do the work of actually integrating the data. Therefore, having limited business logic in the source extract means that you don't have to focus on both the legacy code AND the integration ETL code when troubleshooting.

 In the end, we all agreed that we'd keep the logic in the legacy extracts as simple as possible, and put as much business logic as possible in the ETL.

Tuesday, December 15, 2009

Fun with Recursive SQL (Part 3)


This blog post, Part 1, and Part 2 are derived from an article that I wrote as an article for Teradata Magazine and for Teradata Partners conference about fun uses for recursive SQL.

Finally, to the really interesting example of how to use recursive SQL productively.  In this example, we're dealing with segments of time that are defined with a start and end time.  Imagine staggered work shifts or a work schedule that specifies when a worker is scheduled to take breaks.

Imagine a series of time segments, like those shown in the picture below.  Each segment has a start and end time as well as a priority level.




The desired output from this data is a new, single, continuous time line that denotes only the highest priority segment for any particular clock time.



Here's a simple summary of the recursive algorithm that we're going to apply to solve this challenge:
  • For each segment at a given priority level
  • Join to the first segment that appears from a higher priority level that overlaps in any way
  • Calculate the results of the overlap.  This may result in:
    • A head segment
    • A tail segment
    • Or two new segments

We'll start off with a table that has the following definition and sample data:
create table tl (
  seg_num integer not null,
  p integer not null,
  st_tm integer not null,
  end_tm integer not null
);

insert into tl values (1,1,100,400);
insert into tl values (2,1,445,600);
insert into tl values (3,3,130,200);
insert into tl values (4,3,315,415);
insert into tl values (5,2,345,500);


You can find the first layer of overlaps that will cause splicing of any given segment using a self-join that matches every time segment with any time segments that:
  • have a higher priority; and
  • either:
    • have a start time between the current segment's start/end time; or
    • have an end time between the current segment's start/end times.
You can see the three conditions in the WHERE clause below:

select a.seg_num, 
       row_number() over(order by b.st_tm) rnk,
       b.seg_num, b.p, b.st_tm, b.end_tm
from   tl a,
       tl b
where  a.p < b.p
  and  ((a.st_tm  < b.st_tm  and a.end_tm > b.st_tm ) or
        (a.st_tm  > b.st_tm  and a.st_tm  < b.end_tm))

The trick is that we need to do the same kind of operation on the output of this first statement to determine if there are other, higher priority segments that need to splice these further. So, we recursively call the same kind of SQL on the output. In the first attempt at the recursive version, we also add some CASE logic to output exactly what the newly created slices are.

with recursive tl_rec (
  seq_num, seg_num, p, st_tm, end_tm, b_seg_num
) as (
  select 1, seg_num, p, st_tm, end_tm, -1
  from   tl 
  union all
  select a.seq_num + 1, a.seg_num, a.p,
         case when a.st_tm  < b.st_tm and 
                   a.end_tm < b.end_tm then a.st_tm  
              when a.st_tm  < b.st_tm and 
                   a.end_tm > b.end_tm then a.st_tm
              when a.st_tm  > b.st_tm and 
                   a.end_tm > b.end_tm then b.end_tm end,
         case when a.st_tm  < b.st_tm and 
                   a.end_tm < b.end_tm then b.st_tm  
              when a.st_tm  < b.st_tm and 
                   a.end_tm > b.end_tm then b.st_tm 
              when a.st_tm  > b.st_tm and 
                   a.end_tm > b.end_tm then a.end_tm end,
         b.seg_num
  from   tl_rec a,
         tl b
  where  a.p < b.p
    and  a.b_seg_num <> b.seg_num
    and  ((a.st_tm  < b.st_tm  and a.end_tm > b.st_tm ) or
          (a.st_tm  > b.st_tm  and a.st_tm  < b.end_tm))
)
select * from (
  select 
    row_number() over(partition by seg_num 
                      order by seq_num desc) rnk,
    seg_num, p, st_tm, end_tm
  from tl_rec ) z
where rnk = 1
;

And you get the following output records:



But wait!  There's something missing.  There should be a left-over LOW segment there between the two HIGH's.  We lost something along the way... Hmm...  The output of the join we've defined can only be one record.  So, when a higher priority segment is contained within the start/end of a lower priority segment, we're only returning the leading piece of the LOW segment.  There's no way for that join to output two records with different logic, though.  Luckily, with recursive SQL, we can have more than one recursive statement.  So, we just need to add another UNION ALL to output the trailing LOW segment.

with recursive tl_rec (
  seq_num, seg_num, seg_split_num, p, 
  st_tm, end_tm, b_seg_num
) as (

  ... repeat from previous SQL ...
  
  union all
  select a.seq_num + 1, a.seg_num, a.seg_split_num+1, 
         a.p, b.end_tm, a.end_tm, b.seg_num
  from   tl_rec a,
         tl b
  where  a.p < b.p
    and  a.b_seg_num <> b.seg_num
    and  (a.st_tm  < b.st_tm  and a.end_tm > b.end_tm)
)
select * from
(
  select 
    row_number() over(partition by seg_num, seg_split_num 
                      order by seq_num desc) rnk,
    seg_num, seg_split_num p, st_tm, end_tm
  from tl_rec
) z
where rnk = 1
;

Phew! With that combined SQL, now we finally get the expected output!




Personally, I recommend studying through that a few times so that you've really internalized the strategy. It certainly took me a while to actually get it to work correctly. Then look around at your existing projects or challenges you've faced in the past and see where else you can apply this same kind of recursive technique.

Good luck!

Saturday, December 12, 2009

Fun with Recursive SQL (Part 2)


This blog post and Part 1 are derived from an article that I wrote as an article for Teradata Magazine about fun uses for recursive SQL.

Part 1 of this series described a way to use recursive SQL to build up a single comma-separated field from a set of rows that were all part of the same grouping. Part 2 describes how to do just the inverse: take a field that has a comma-separated list of values and produce one row for each of the items in that list. What I refer to as "string normalization."

In the code below, the seed statement is simply a select * from our source table. The recursive statement takes the field to be pivoted into multiple rows and uses the POSITION() function to repeatedly pull off just the first item in the list. The recursion continues until our POSITION() calculation in the "dpos" field tells us the field is blank.

The context for this example is a set of survey responses in a table that contains a RESPONSE_KEY (just some primary key) and RESPONSE_TXT (a comma-separated list of responses). The SQL below parses that RESPONSE_TXT field into a single elements and outputs one row per element along with the associated RESPONSE_KEY.

WITH RECURSIVE parse_list
       (response_key, delim_pos, item_num, element, remainder) AS 
(
    SELECT response_key, 0, 0, CAST('' AS VARCHAR(100)), response_txt
    FROM   response
    UNION ALL
    SELECT response_key,
           CASE WHEN POSITION(',' IN remainder) > 0
             THEN POSITION(',' IN remainder)
             ELSE CHARACTER_LENGTH(remainder) END dpos,
           item_num + 1,
           TRIM(BOTH ‘,’ FROM SUBSTR(remainder, 0, dpos+1)),
           TRIM(SUBSTR(remainder, dpos+1))
    FROM   parse_list
    WHERE  dpos > 0 
) 
SELECT response_key, element 
FROM   parse_list p
WHERE  item_num > 0
ORDER BY response_key, item_num;

If the input RESPONSE table had the following data:
RESPONSE_KEY   RESPONSE_TXT
643            eggs, milk, toast
872            cereal, juice, bagel, cream cheese

Then the output would look like this:
RESPONSE_KEY  ELEMENT
643           eggs
643           milk
643           toast
872           cereal
872           juice
872           bagel
872           cream cheese

If you're new to recursive SQL, take the time to read, try, and experiment with these first two uses. Part 3 of this series will describe a way to consolidate multiple overlapping time segments, using some specified order or precidence, to create a single consolidated time line. That sounds confusing and potentially worthless until you see the real world example.

Friday, December 11, 2009

The "Santa Intelligence" Team


Thanks to Henrik L Sørensen for his great commentary about the challenges that Santa must experience with regard to data quality (and inspiring this response).

It turns out, little did we all know until this year, that Santa does indeed have a Business Intelligence team.  This year, Disney released a new Christmas special called Prep and Landing that explains some of the secrets behind his success with 3 billion customers.  Warning - spoilers ahead.

In the story of Prep and Landing, we learn first that Santa has a tactical team of elves that go out to houses before he gets there to prepare them for his arrival.  They check to make sure everyone is asleep, stockings are prepared, etc.  Our main protagonist, Wayne, is aspiring for a promotion to Director of Naughty List Intelligence.  We never learn much about this highly prized position (except that Wayne is more suited to operations than analytics), but I think it's easy to speculate based on some of the things we see in the movie:
  • Create ideal flight plans for Santa to optimize his travel around the globe to hit all the Nice List houses most efficiently, taking into account not only the location of the houses, but also the weight of presents to deliver as many high-weight presents as early as possible.
  • Ensure quality of the Naughty List cross-checking names, addresses, family members, and friends to guarantee that no one is accidentally misidentified to be on the list.
  • Develop new Naughty/Nice algorithms and expand the base of behavioral information used in determining which children belong on which list.
  • Identifying "might also like" gifts to add as special surprises for children who reached the top decile on the Nice List.
  • Optimize the priority order for gift assembly throughout the year so that all of the presents of the predicted Nice List are ready by 12/24, without making too many gifts for kids who might end up jumping to the Naughty List as the last minute.
  • Create an optimal work-out schedule for Santa to ensure that he can burn off, over the course of the year, all the extra cookie and milk calories he consumes on Christmas Eve.
 I can only imagine what kind of data warehouse platform Santa must be running.  I'm sure his data center is there at the North Pole... to keep the cooling bills down to a minimum.

Fun with Recursive SQL (Part 1)


This blog post is derived from an article that I wrote as an article for Teradata Magazine about fun uses for recursive SQL.

Show of hands: how many of you have used recursive SQL in the past six months? For something useful?  Using ANSI syntax rather than Oracle's CONNECT BY?  If you said "yes" to all three, then congratulations!  You've won a prize: 15 minutes of your day back.  You can jump down to the bottom of this post and just review my fun example.  This first part is just an introduction to recursive SQL.

Introduction to Recursive SQL
First let's just review the definition for "recursion."  Handy-dandy dictionary here.... recourse... rectify.... recursion:
See: recursion
Not very helpful.  How about this instead.  A recursion is a method of solving problems in which the algorithm or function applies itself in the execution of the solution.  For instance, if I were to define a recursive algorithm for summarizing a list of numbers, I could say something like "the sum of a list is determined by adding the first number in the list to the sum of all the remaining numbers in the list; if the list is empty, then the sum is zero."  You notice how the definition of "sum of a list" includes the use of the "sum" operation, as in "sum of all the remaining numbers."  You'll also note that there's an explicit terminal condition to define what happens when you have nothing in the list.

So, recursive SQL works the same way.  A recursive SQL statement is one in which the determination of the result set requires the execution of SQL on the result set itself.  In recursive SQL, there has to be a seed statement, as well as the recursive call, and a termination condition.

WITH RECURSIVE [temp table] [column list] AS
(
  [seed statement]
  UNION ALL
  [recursive statement]
)

A Simple Example
For the sake of comfort, a simple example might involve an employee hierarchy table where each row had the employee ID and associated manager ID.  A standard query could retrieve a list of direct reports for a given employee ID:  SELECT * FROM employee WHERE mgr_id = 123.  A recursive query, however, could return the list of all employees anywhere under the hierarchy below that same manager.  An example that might be:
WITH RECURSIVE emp_hier (emp_id, mgr_id, level) AS
(
SELECT a.emp_id, a.mgr_id, 0 
FROM   employee a
WHERE  a.emp_id = 123
UNION ALL
SELECT b.emp_id, b.mgr_id, c.level+1 
FROM   employee b,
       emp_hier c
WHERE  b.mgr_id = c.emp_id
)
SELECT e.emp_title, e.emp_id, e.mgr_id, h.level
FROM   employee e,
       emp_hier h
WHERE  e.emp_id = h.emp_id
  AND  e.emp_id <> 123;

In this query, the is a simple select that pulls the manager record for which we want all of the descendants.  The selects all records from the employee table where the employee is managed by someone already in the emp_hier temporary table – hence the join on employee.mgr_id and emp_hier.emp_id.

On to some more exciting examples...


Example 1: String Concatenation
Ever notice that there are no string aggregation functions in SQL?  Something like a SUM() for character fields that would take two parameters: the field or expression from each row and a delimiter.  Usually, programmers end up having to write code outside of the database to do that kind of string manipulation. Recursive SQL can achieve that kind of character aggregation quite nicely if used appropriately.

Consider a list of email addresses for several members of the same family.  The table has a FAMILY_ID, PERSON_ID, SEQuence number, and EMAIL_ADDR.  The desired output is a single row that contains a comma-separated list of all the email addresses for the entire family (for a mailing list perhaps).

WITH RECURSIVE email_list (coverage_id, emails, seq) AS
(
  SELECT m.family_id, m.email_addr, m.seq
  FROM   members m
  WHERE  m.seq = 1
  UNION ALL
  SELECT e.emails ||’, ’|| m.email_addr, m.seq
  FROM   email_list e, members m
  WHERE  m.seq = e.seq + 1
    AND  m.family_id = e.family_id
)
SELECT r.family_id, r.emails
FROM   email_list r
WHERE  r.seq = (SELECT MAX(h.seq) FROM email_list h 
                WHERE h.family_id = r.family_id 
                GROUP BY h.family_id)
;
Let's walk through the SQL piece by piece.  First, the seed statement starts the result set off by picking just the first person from each family (that is, the people with a SEQ=1).  That sequence number is admittedly a bit artifical and contrived to make the recursive SQL easier to write.  If you needed to, you could generate the SEQ on the fly using a RANK() of people within each family and then leverage the RANK() as a sequence number.  Hopefully the example isn't lost because of that bit of confusion.

The recursive portion of the SQL simply takes what was seeded in through the first query (i.e. people with SEQ=1) and joins that to all the records in the EMAIL_ADDR table with a SEQ = SEQ + 1 (from result set).  The EMAILS output field is created by concatenating the email address from the seed to record with the next highest sequence number.

Then that result set is taken and the same thing is done.  And so on.  And so forth.

The query terminates when the result set contains no records.  That is, when we exhaust all the SEQ numbers in the table.

The outer query refines the results down to exactly what we're looking for by keeping the final record and throwing away all the intermediate records that were used to built up the results.

Next Time
Look ahead for additional posts on how to use recursive SQLto parse a field with comma separated values and how to split and reconcile overlapping time segments into a single, flat time line.

Wednesday, December 9, 2009

assert(datawarehouse.data.is_correct());

If a man begins with certainties, he shall end in doubts;
But if he will be content to begin with doubts,
He shall end in certainties.
[Francis Bacon 1561-1626]


When I was learning to program in C and studying algorithms, the assert() assertion macro was one of my favorite debugging tools.  Assert can be used to validate that something isn't going wrong that could send your program into left field during the execution of some procedure.  For instance, a balanced binary search tree should never be more than log2(n) levels deep (or something similar to that based on the exact insertion algorithm), where n is the number of items in the tree.  After a new item is inserted in the tree, you can assert(tree.depth() == log2(tree.count())).  If that assertion fails, then you know the tree isn't staying balanced and the search performance guaranteed by a balanced tree isn't valid any more.

If that's too much computer science for you, hold on and see where this is going.  There's relevance to this idea beyond low-level programming and computer science theory.

I've been in many conversations with data warehouse sponsors that focused on the question of "how are you sure that the data in the warehouse loads correctly every night?"  One of the better ways I've found to approach this kind of data integrity assurance is to think about what kinds of assertions can be found throughout the batch ETL processes that I create.

For this example, suppose a somewhat traditional sort of ETL process that happens in the following steps:
  1. Copy or extract raw data from source system
  2. Detect changes from last pull
  3. Lookup surrogate keys and other translations
  4. Apply deletes (as soft-deletes with setting exp_date = current_date())
  5. Apply inserts
  6. Apply updates
Now, read the wikipedia article about assertions and test-driven development and think about what kinds of things you may want to be able to assert during each phase of the ETL.  (Excuse my sort of object oriented pseudo-code.)
  1. Clearly after the raw data is copied from the source system, you can test to see if you've captured all of those records in your own work area, or possibly perform a sort of checksum on the data to ensure the integrity of the copy.
    • assert(source.count() == mycopy.count())
    • assert(source.checksum() == mycopy.checksum())
    1. If you have to manually detect changes in the source data, rather than relying on source system audit information, there are other assertions that you can perform during that change detection process.
      • assert((source_previous.count() - source_current.count()) == changes.insert_count() - changes.delete_count())
      1. After looking up dimension or reference surrogate keys, you can check that you haven't lost any records and that all the reference surrogate keys have valid values (if that's appropriate for your implementation.)
        • assert(record.dimension_key > 0)
        • assert(input.count() == output.count())
        1. After deletes are applied, you should be able to test that the number of rows with a current last-updated audit timestamp and expiration date of today should match the number of rows marked for delete.
          • assert(changes.delete_count() == target.count(where last_update==today() and exp_dt==today())
          1. After the inserts are complete, the insert count can be measured using the inserted audit timestamp.
            • assert(changes.insert_count() == target.count(where  insert_audit==today)
          2. And the updates can be measured using the last-updated audit timestamp and an expiration date greater than today.
            • assert(changes.update_count() == target.count(where last_update==today() and exp_dt > today())
          3. We may also choose to assert that we aren't trying to update any records that the source systems says were deleted.
            • assert(target.count(where last_update==today() and exp_dt < today())
          Your ability to create reasonably easy to execute assertions at the batch level and at the individual record level  depends on how your data warehouse and ETL are designed, of course.  Whether working on an existing system or starting something new, look at data quality integrity assurance from the perspective of traditional asserts and see if that changes how you consider building those ever important data quality assurance jobs.

            Tuesday, December 8, 2009

            Whiteboard Fail


            This is a story about how the overuse of whiteboards can be a sign of something worse and create a risk to project success...


            I once worked with a project manager who was leading a particularly difficult data warehouse project.  The complexity came form both the number of different source systems that were sending information into all of the same data warehouse tables as well as the data marts that were used to answer specific questions based on some (but not all) of the source systems.  When I had my first meeting with this lead to talk about the state of affairs, he'd been on the project for nearly twelve months and explained that they were a couple of months behind schedule, but working lots of overtime to catch up.  I asked him to explain the nature of the data and the solution his team was building.  He spent ninety minutes diagramming on his white board.  Starting from a general overview of the solution and drilling down to various nuances and complexities of the data itself.  It was a superb series of diagrams.  He used color and gestures and inflection in his voice to emphasize various points.

            When he was wrapping up and erasing the last diagram, I asked him if there where any electronic copies of these diagrams he'd been reproducing so deftly on the board.  "No.  We've never gotten around to actually drawing those," he responded.  "We've been too busy with the implementation."

            Fourteen months later, when he burned out and left the company, there were still no electronic versions of those system diagrams and the project was more that six months behind schedule.  Within days of his appointment, the new project manager had electronic versions of those same images that had once only been in the former lead's head, and within two months the project was beginning to recover some lost ground.

            While everyone on the project team had in their heads the same general concept for the project, the fact that he never put them down electronically was an implicit way of not sharing with the rest of the team.  The rest of the team didn't feel a comfortable connection or trust with the fundamental designs they were working with and spent more time than necessary questioning their design and implementation decisions.  As soon as the diagrams were on paper and published throughout the project team's workspace, decisions became more obvious, team members were more confident, and work began taking less time.

            Obviously, I'm not saying that whiteboards aren't second only to whiteboard paint on the list of greatest inventions of all time.  Still, there are some ways to make whiteboarding more effective:
            • Purchase smart boards
            • Capture whiteboard contents with a digital cameras and an image clean-up tool like scanr
            • Use a tablet PC and overhead
            • Learn to diagram in Visio (or other tool), live during a meeting

            Monday, December 7, 2009

            Data Quality - A Family Affair

            Grandma's lesson about taking responsibility for data quality.

             
            When I was a young child, we spent every Thanksgiving with my paternal grandparents in Denver.  There are two particularly memorable things about those visits.  First, even into the late 1980's, my grandparents didn't own their own telephone.  They rented their phone from the telephone company.  It was the same rotary dial phone they'd had for years, hanging in their kitchen, with an extra long handset cord attached so they could stretch across the dining room or kitchen while still talking on the phone.  Second was the important lesson that I learned about doing dishes by hand.

            Doing dishes by hand is ideally a three person job: one to wash, one to rinse, and one to dry.  The lesson that my grandmother taught me about washing dishes was that the drier is the person accountable for making sure the dishes were clean when they went back into the cupboard.

            As data warehousing professionals, we spend a fair amount of time and energy arguing that data quality is something that has to be fixed up stream, by applications.  My grandmother would insist that sending the dishes back to the washer is not our only option.

            If a dish comes to the drier not quite clean, there are three options:
            • send the dish back to the washer to be cleaned again from the beginning with soap;
            • send the dish back to the rinser to have the mess rinsed off with some hot water; or
            • use a little extra effort and wipe off the mess with your dish cloth.

            Ideally the dishes come to us clean and ready to dry.  It's a lot less work to dry off some steaming droplets of water and put a nice clean warm dish away in the cupboard than it is to notice that little bit of bread from the stuffing that didn't quite get cleaned and have to use the tip of your fingernail through a dish cloth to get the crumb off.

            What are the downsides of sending the dish back through to be rewashed from the beginning:
            • the washer has to stop in middle of scrubbing that big pan to rewash the plate;
            • the plate takes longer, overall, to be rewashed, rerinsed, and redried;
            • both the washer and rinser have to redo work.
            Perhaps the same is true in terms of data quality.  If a transaction moves from system to system and doesn't come out the other end quite exactly clean, because some of those business processes in the middle aren't quite exactly flawless, is it always the best choice to go back to the beginning to find just where things went wrong and correct them there?

            I'm not suggesting that any application is allowed to be intentionally lazy about data quality, or should not correct issues that are identified.  Rather, I'm suggesting that we make sure we all continue to see data quality as our responsibility and not merely blame up stream systems when there is something that could be done at various points in the chain to ensure quality information is used for decision making.

            Sunday, December 6, 2009

            The Unix Philosophy Applied

            Applied to Application Glut...
            I've always been a big fan of the Unix Philosophy.  It describes a lot of my thinking about systems development and enterprise architecture.  There are times that I wish that application implementers could see the same principle in their own work, too.  Here's an example:

            Company X does a major ERP implementation, standing up an enterprise wide GL, AP, HR, PR, and SCM solution.  A couple years later, the philanthropy group decides that they need a consolidated view of the several different applications that are being used to manage prospective donors.  They decide to buy a finance package from the same vendor that supports their prospect management solutions.  When questioned about why they didn't just use the enterprise ERP solution, they explained that it was "too complicated.  We'd have to set up a new company and a new account structure to support the philanthropy work."
            It's too hard to take my clothes downstairs, figure out how to use the washing machine, wash them, dry them, fold them, and put them away.  I'll just go to the store and buy some new clothes.
            ESR, in his description of the Unix Philosophy, declares the Rules of Parsimony: Write a big program only when it is clear by demonstration that nothing else will do [link].

            I think the same applies to an enterprise environment: build or buy new software only when it is clear by demonstration that no existing application will do.

            Company Y has an central, enterprise-wide, standard ERP system.  It also has a separate budget application.  Separate applications for AR.  Separate applications for capital management and capital budgeting.  Separate applications for contract management.  And lots of duct tape and glue holding all of these things together.  Sometimes the glue has to be held on with duct tape because one system is made of a material that glue won't stick to very well.  I'm stretching the analogy.

            I think the Unix Philosophy applies in these situations, though; or maybe it's the MacGyver philosophy where you take a long a basic set of tools and then work with whatever you have on hand.

            Applied to Data Warehouses...
            Dan Linstedt wrote a blog post a couple of years ago that describes one way that this applies to data warehousing as well: Federated Star Schemas Going Super-Nova.  I've used Dan's argument more than once to describe the risks of using a strict dimensional model for a transaction grain enterprise data warehouse.  The end result is a data warehouse in which stars only loosely relate to each other at high levels and the same concepts get duplicated inappropriately.


            I think that this philosophy is especially relevant during tough economic times.  More often than we think, we can use what we have, even if it isn't tailored exactly to what we would ideally like to have.  If you can use a little bit of customization to make things work, that's better, in the whole, than buying or building a whole new system that will require lots of glue to integrate into your enterprise suite of solutions.

            Friday, December 4, 2009

            PragProWriMo Reflection


            My Gratitude:
            I want to thank @praglife and The Pragmatic Bookshelf for the spur to participate in NaNoWriMo this past month.  I've always wanted to write a technical book and now I can certainly say that I have, whether I ever publish it or not.  So, "thank you."

            The bigger thanks goes to my wife, a 4-year veteran NaNoWriMo writer.  It was fun to write together this year rather than feel like I was merely in a support role while she wrote a novel over the course of a month.  We really enjoyed writing together, and she provide me with great encouragement.  "Thank you!"

            My Experience:
            I'm not sure what goals other PragProWriMo writers set for themselves this past month, but I (perhaps ambitiously) followed the NaNoWriMo goal of 50,000 words.  In our house, the goal was 2,000 words per day with one day off per week.  That still keeps you ahead of the standard 1,667 word per day pace, even with the day off.  Our house is also occupied by two wonderful, but often time consuming toddlers.  So, writing took place mostly after 9:00 PM or maybe a quick couple hundred words during the day on the weekend.  2,000 words per night, for me, was about two hours per night.  Depending on your own sleep habits, experience with small children that get up early, and penchant for sympathy, respond to that description of the time commitment as you see fit.

            Beginning:
            I did start my endeavor with some topics that I wanted to cover within the scope of The Practical Data Warehouse.  I had 47 ideas that needed to be covered throughout the book, and felt like that was probably a dozen or so short, but I could pick those up along the way.  So, I started writing at midnight on November 1st.

            Despite the fact that I tend to start most of what I write with the introduction and executive summary sections, to lay the ground work in my head for what the rest of the paper should convey, PragProWriMo started right in with the first chapter of true techncial content.  Things moved fairly quickly at first.  Most of what I wanted to write was already in my head, just waiting to transfer via keyboard.

            Organizing:
            As I started writing, various non-sequitor ideas would float through my head, too.  All of those went into a backlog list or as a note in some other skeleton section somewhere else in the book.  I wrote just enough in those notes to make sure I could remember what the topic really was supposed to be about.

            That said, I certainly didn't compose linearly.  I spend the first couple of days in section 2, then jumped to section 5, back to 3, then off of section 1 for a little while.

            Motivating:
            Most of the energy to keep writing came from the experiences my wife has had writing novels for NaNoWriMo in the past.   Sometimes, other tools were necessary, too: chips and very hot salsa, candy, Starbucks.  Things like that.

            The internal motivation, though, came from the fact that I have accumulated a lot of thoughts and ideas about data warehousing, in particular, throughout my career; and I thought there was something more to be said beyond the books and the blog posts that I read.  My reading and experience tells me that many data warehouse projects struggle because people don't know what works and what doesn't.  That comes in primarily two different types: either people are technically unqualified to elegantly address the challenges of data warehouse implementation or they don't understand the purpose and potential value of data warehouse solutions.  So, I'm hoping that what I have to say can fill a gap there and bring people from either of those extremes toward a better understanding of data warehousing.

            The Challenges:
            My biggest writing challenge has always been coming up with an idea that I think someone else will think is worth writing.  I've written a few magazine articles before, but only when I thought I truly had a novel idea that I wanted to share with the world.  I'd have written more, but that's a pretty high standard, I think.  So, in the course of writing the first draft of The Practial Data Warehouse, I had to fight that demon that kept telling me what I had to say wasn't worth reading.

            The other major challenge was exhaustion (see previous section about Toddlers that get up early).  There were several nights where I nodded off at the keyboard, and woke to find a few garbled sentences and 1,652 letter "P"s in the middle of my document.

            The Next Steps:
            My plan from here out is to rest on The Practical Data Warehouse for another couple of days, and then start revising.  One topic a night for an hour.  I think that most nights, I'll end up with a blog post that is related to the topic in the book.

            As I go through topic by topic, I expect that I'll need to do a few more experiments to create examples for some of the content ideas.  That should be fun.  As a current PHB, I don't get to write nearly as much code as I used to.

            In a month or two... off to a publisher!!

            Wednesday, December 2, 2009

            The Agility of Touch-It / Take-It

            A parable about the agility that "Touch It, Take It" adds to data warehousing; and the extra work that a misuse of "You Aren't Going to Need It" creates.

            Once, there was a great chief called Yagni.  Chief Yagni's village was very prosperous and had grown much during his rule.  Eventually, Chief Yagni decided that it was time for the village to have a new gathering space as the old one had been well out grown.  So, Chief Yagni recruited 2 strong men, Gwyn and Titi, to bring stones from the quarry to the village center so that a group of builders could stack them into the new gathering space.

            Gwyn and Titi both arrived on Monday morning to receive their directions from Chief Yagni.  Yagni reviewed the building plans and told Gwyn and Titi that he needed 20 large flat square stones for the base of the building.  Gwyn and Titi took their push carts down to the quarry, gathered rocks, and returned to the village center.  They emptied their rocks together in piles for Chief Yagni's review.  Gwyn's pile had exactly 10 flat square stones.  Titi's had 10 flat square stones and 3 smaller angled stones.

            "Why are these here?" asked Yagni.

            "I had to pick them up off of the flat stones in the quarry," replied Titi, "so I thought I would just bring them along in case there was a use for them."

            "Get rid of them!" shouted Yagni angrily.  "You've wasted time in gathering those worthless rocks I did not ask you to collect.  We need 10 tall narrow stones for the doorway.  Go back to the quarry and bring me those.  Only those, Titi!  When you see the other rocks, tell yourself that You Aren't Going to Need It."


            Gwyn smiled at Titi's scolding, feeling proud that he'd followed the chief's directions so precisely.  Titi still believed that the angled stones might eventually come in handy, though.  Gwyn and Titi began pushing their carts back to the quarry.  Gwyn's light and empty.  Titi's partly full of unwanted rocks.

            Frustrated, not wanting to push the angled rocks all the way back to the quarry, Titi dumped the extra rocks in a small pile just outside of chief Yagni's sight.

            Gwyn and Titi gathered the tall narrow stones the chief asked for.  Titi, again, had to clear angled rocks from on top of the narrow stones, and added them to his cart.  This time, Titi paused and dropped the extra angled stones on his pile just outside of Chief Yagni's sight.

            Gwyn and Titi returned to the chief with their carts full of only tall narrow stones and the chief was pleased with them both.  This continued for several more trips until the new meeting place was nearly complete.  Gwyn following directions exactly and Titi always bring back more than asked, but leaving the extra angled rocks on his pile.  By this time, Titi has accumulated a large enough pile of angled stones to fill his entire cart.

            On their last delivery to Chief Yagni, the chief looked over the plans again and stroked his chin in thought.  "Gwyn, Titi," he said.  "I need you to bring some angled stones for the roof.  Like those that you brought back on the first trip, Titi.  Go back to the quarry and bring me two carts full of those."

            Gwyn and Titi hurried back toward the quarry.  Gwyn went to the quarry and began collecting his cart full of angled rocks, but Titi had the large pile he had been accumulating during his other trips.  He stopped just outside of the chief's sight, collected all of his angled rocks into his cart, and returned to the chief well before Gwyn had even loaded half of his own rocks.

            "Titi.  How did you gather these rocks so quickly, when Gwyn hasn't returned yet?"

            Titi explained to his chief that when he had to pick up a stone anyway, he decided that if he had to Touch It, then he should Take It.

            The chief was please with Titi's foresight and promoted him to lead rock collector.

            Sunday, November 29, 2009

            #PragProWriMo Complete!


            In my #PragProWriMo writing marathon, I've completed most of what I set out to say as well as a few more topics. After 50,058 words (around 200 page), I know there are still some significant holes in The Practical Data Warehouse.; Research to do. Annotations and references to add. Several sections to completely scrap.  Many sections to rework.

            I think it does have a good solid outline, though; Something that I can work from now and make a go of.

            Consume: The 4th "C"

            A data warehouse, as with other technology solutions, is a partnership of those who build it and those who use it.  Both have a responsibility for the success of the data warehouse.  Therefore, consume is the fourth "C" of data warehousing.

            If the sound of a falling tree occurs in the woods and no one is there, did the tree fall?

            The data warehouse is fundamentally a communication tool.  The information that it conveys through its data and the representation it provides of business activities can be incredibly valuable.  All the work that goes into collecting, correlating, and conveying information in the data warehouse is wasted if the right decision makers are not consuming the information.  Consumption is not only about the technical ability for some abstract decision maker to use the data warehouse for analysis.  Rather, it is about ensuing that each enhancement to the data warehouse helps to drive adoption and increase both strategic and operational use of the system.  As users have a positive experience that aligns to their immediate business need, they're more likely to come back to the data warehouse with their next question, rather than striking off on a journey to collect information from various systems, correlate that data together, and then convey that information to decision makers and senior leaders.

            #PragProWriMo word count: 50,058 - SUCCESS!!

            Saturday, November 28, 2009

            Convey: The 3rd "C"

            If a tree falls in the woods and no one is there, does it make a sound?

            One of the primary responsibilities of the data warehouse is to take information that would otherwise not be available and deliver it to decision makers and analysts who can do something valuable with the information and make the world a better place.  Perhaps you might think that's a bit of a stretch, or perhaps I'm being overly optimistic.  However, my position is that we are more likely to make decisions that move society forward if we have more knowledge about the context in which we're making those decisions.  So, even if we don't always use better information to make better decisions, we couldn't even have the opportunity to try to make decisions if not for tools like the data warehouse.

            Effectively conveying information is a factor of the format of the information, performance of the system, the accuracy of the content, and the ease of use with tools that will be used to access the information.  Most importantly, the data warehouse has to be able convey to users, in a natural way, how to find the information they need and what aspects of the physical implementation impact how the system can and can't be used.

            #PragProWriMo word count: 48,061

            Thursday, November 26, 2009

            Correlate: The 2nd "C"

            Collecting information from a variety of sources is powerful in and of itself, but having an integrated collection of data makes analysis across business processes, subject areas, and source systems many times more efficient.  The data warehouse must provide an access layer that allows users to easily and naturally merge together pieces of business data that are naturally related in the business model.  There are some aspects of correlation that are business driven and the data warehouse will be a victim or beneficiary of.  For instance, the consistent assignment of employee numbers across applications or an enforced uniqueness of cost centers across multiple financial systems or the corporate data governance decision that only one systems can be considered the authoritative source for any particular data element. 

            Other correlation activities are ones that can be supported directly within the data warehouse environment:
            • Code set translations ensure that a user can reference any data source using the approved corporate standard code set for some particular attribute.
            • Cross-reference translates that allow two different identifiers for the same particular entity to be combined together.
            • Transformation of one representation of a particular identifier into a matching representation somewhere else.  For instance, if one system uses a 15 digit cost center that includes accounting unit, account, and sub account, the data warehouse should take that 15 digit string, break it into components, and match those against the information from the finance system.

            Having a data warehouse to preexecute these kinds of correlation activities once and store the results for convenient use in tens of thousands of queries every month or even every day is a huge benefit to analysts through clear communication and improved performance.

            #PragProWriMo word count: 46,141

            Tuesday, November 24, 2009

            Collect: The 1st "C"

            The 4 C's of Practical Data Warehousing take a very open approach to describing what a data warehouse is responsible for doing.  Perhaps, the function that can be interpretted most broadly from a technical perspective is the first C: Collect.

            The first job of a data warehouse is to bring together, in one business tool, the information that an analyst, knowledge worker, or decision maker needs to understand business operations.  Whether that collection of information is a literal copy between data stores, a federation of databases, or a more complicated transformation of transactional data into a dimensional model is merely an implementation detail.  Some of those choices are more or less practical depending on the underlying systems and data (which we'll discuss later), but to be of any value, the data warehouse has to do th collecting of information together so that users don't have to.  One of the key benefits of data warehousing is the ability of an analyst to have a one-stop-shop for most of the information they need to do their analysis.  In analytically immature organizations, analyst will typically spend 80% of their time collecting data and putting it into some kind of local data store (which might anything from flat files to MS Access to small databases) and only 20% of their time doing analysis on that data.  One of the goals of the data warehouse is to flip that ratio so that knowledge workers are able to spend 80% of their time analyzing business operations and only 20% of their time retrieving data from as few different sources as necessary.

            When various analysts from different departments (marketing, strategic planning, sales, finance, etc) all ask the same people for the same data on a continual basis, it also prevents the application teams from having time to make improvements or plan upgrades to the applications themselves.  There are still organizations that have multiple staff members dedicated to the work of fulfilling data extract requests to support internal analytical needs.  A data warehouse that collects the information together once, into a common enterprise model of business activities, satisfies all of those individual departments with one extract from each source system and a consolidated environment in which to do their analysis.

            Saturday, November 21, 2009

            The 4 C's of a Data Warehouse

            Anyone shopping for an engagement ring is familiar with the 4 C's of diamond quality: cut, clarity, color, and carats.  While there are other things that make one diamond better or worse than another, these are the most commonly used.

            So, what are the 4 C's of data warehouse quality:
            • Collect
            • Correlate
            • Convey
            • Consume
            These are the 4 things that a data warehouse has to be most effective at to achieve success.  Over the next four posts, I'll describe what each of these represents and ways that you can measure the quality of your own data warehouse implementation against these.


            #PragProWriMo word count: 36,152

            Wednesday, November 18, 2009

            Analysis of Algorithms for Data Warehousing

            This topic was covered briefly in a previous post, but understanding algorithms in SQL can be very powerful.  I hope to shed some additional light on this concept with this post.

            Contrary to what many programmers will tell you, SQL can be a very powerful language.  Like any language, it has it's limitations and idiosyncrasies, and despite what database vendors might tell you, the database doesn't always know the best way to execute a particular query.  Database query optimizers are limited by their imperfect knowledge of what a user wants via SQL and the imperfect statistics that it has about the underlying data.

            One option is to use a hinting feature that some database management systems provide.  Hinting gives the SQL developer a means to instruct the database what order to perform joins or which indexes to use or not use.  It can be powerful, but it also depends greatly on the demographics of the underlying data.  If the data changes dramatically, then the same hints may no longer provide the optimal execution plan.

            Another way to improve SQL performance is to reconsider the algorithm you write as a developer.  Like any programming language, SQL typically provides you more than one way to solve any particular problem.  A developer's ability to choose the most optimal implementation depends on knowledge of the language and experience with mathematical techniques like Big-O notation.  Data warehouses can be particularly challenging environments to do this type of analysis of algorithms because the queries being posed can be very complex.  Unraveling the business logic and reimplementing it with another SQL algorithm can become very complicated.

            Consider the following typical example, though.  You have a transaction table that records all of he individual states that a particular patient may go through as they stay their course in a hospital:

            PATIENT_ID
            VISIT_ID
            EVENT_ID
            EVENT_TYPE
            EVENT_TS
            

            A typical question would be "What is the duration, start, and end time for each patient's visit?"  You might choose to write this by joining the event table to itself, on the one side filtering for "admit" event types and on the other side filtering for "discharge" event types.

            SELECT
              a.patient_id, a.visit_id,
              a.event_ts, d.event_ts, d.event_ts - a.event_ts
            FROM
              event a,
              event b
            WHERE
              a.patient_id = b.patient_id AND
              a.visit_id = b.visit_id AND
              a.event_type = 'admit' AND
              d.event_type = 'discharge'
            

            This SQL is a straight forward algorithm that says I want to match up all the admit records with their appropriate discharge records and return the matched records.  Assuming that most of the records are either admit or discharge, the database is likely to do a full table scan and filter for the admits, and use an index lookup to retreive all the corresponding discharge rows using the patient_id / visit_id primary key.  The Big-O notation for that would be something like O( n log n ).  That's O( n ) for the table scan times O( log n ) for the index tree search to find the matching record.

            A completely different SQL approach to solving the same problem might be to aggregate the admin and discharge rows together in a single pass through the table and then compute the results from that.  That algorithm might be written something like this:

            SELECT
              patient_id, visit_id,
              MIN(CASE WHEN (event_type='admin') THEN event_ts END) as admit_ts,
              MIN(CASE WHEN (event_type='discharge') THEN event_ts END) as discharge_ts,
              discharge_ts - admit_ts as duration
            FROM
             event
            GROUP BY
              patient_id, visit_id
            

            This query will scan through the table once and build the patient_id/visit_id groups in stream along the way.  Computing these results is an O( n ) operation -- significantly more efficient than the common O( n log n ) algorithm.

            While SQL is not a particular robust or flexible language, there are still many inherent opportunities to write either very efficient or very inefficient code.

            Some powerful tools of the trade to learn in SQL optimization include:
            • Single-pass alternatives (as in the example)
            • Recursive SQL (e.g. with recursive or connect by)
            • Analytical functions and options (e.g. aggregates with their own partition by / order by)
            • Multi-grouping extensions (e.g. cube and rollup)
            • How these are affected by different database technologies (MPP, columnar, associative)

            #pragprowrimo word count: 30,137

            Sunday, November 15, 2009

            What is The Practical Data Warehouse?

            There are other books about data warehousing, many written by individuals with very insightful and wise techniques that can be used to build data warehouses.  Why did I start working on a new book?  Many data warehouses have been built with a strict methodical approach in mind.  "Follow these steps and we'll achieve success."  In my experience, the result of those data warehouse initiatives may meet initial user expectations but often fail to be able to support long term growth, expansion, and maintenance.

            I believe that one of the key gaps in these implementation is the creativity required by an engineering design approach.  This book is designed for thoughtful and creative technicians who are looking to apply engineering principles and creative solutions to data warehousing challenges and deliver solutions that achieve short term success as well as long term sustainability.  This book is about principles and practices.

            #pragprowrimo word count: 24,908

            Friday, November 13, 2009

            Avoiding Self-Joins on Large Tables

            With data warehousing solutions, I've run into the case of needing to do self-joins on large tables, a surprisingly large number of times.  A couple of examples might include something like:

            • Customer Addresses Table
              • One row in the table per address
              • Addresses are categorized by a type of home, work, or other
              • Question is to know how many of your customers work and live in different zip codes
              • If either address is not present, then include the record
            The first solution that probably comes to mind is something like the following join between all of the home addresses and all of the work addresses.


            SELECT 
              COALESCE(h.customer_id, w.customer_id) AS customer_id, 
              h.zip AS home_zip,
              w.zip AS work_zip
            FROM
              customer_addr h LEFT OUTER JOIN
              customer_addr w ON 
                h.customer_id = w.customer_id AND
                h.addr_type = 'home' AND
                w.addr_type = 'work'

            WHERE
              (h.zip <> w.zip) OR h.zip IS NULL OR w.zip IS NULL.

            Another, more efficient way to execute the same query in a single pass of the database.


            SELECT
              customer_id,
              MIN( CASE( WHEN ADDR_TYPE_CD = 'h' THEN zip END)) AS home_zip,
              MIN( CASE( WHEN ADDR_TYPE_CD = 'w' THEN zip END)) AS work_zip
            FROM cust_addr
            GROUP BY customer_id;


            In the new code, the query will perform a single set of operations against the table once.



            #pragprowrimo word count: 18,002

            Tuesday, November 10, 2009

            Two Dimensional Time

            One of the trickier data warehousing situations I've encountered is the need to track changes in a time line, over time.  The prototypical situation being something like contract terms, which can be renegotiated and corrected.  The contract provides a specific set of terms that are valid for particular time ranges.  The collection of those constitute a particular time line.  The desire is to track the changes to that time line, over time.  This satisfies an audit point as well provide for the analysis of "if we did a better job understanding and enforcing our contracts, what impact would that have on the bottom line?"

            This concept of tracking changes to a time line, over time, introduces what you might think about as a two dimensional view of time itself.  There is what I would refer to as observed time (the movement of time itself) and perceived time (what particular points in time look like given some point in observed time).

            You can imagine a picture that looks like this:

            Perceived Time
                 ^
               9 | C.CA..AB.
               8 | A.....AB.
               7 | A.AB.....
               6 | A.AB.....
               5 | A........
                 +----------->  Observed Time
                   123456789
            

            The horizontal access represents a view into time.  Movement along the vertical access represents a movement in time and potential restatement of history.

            It's a fascinating way of looking at time two dimensionally.  I suppose that their could be additional dimensionality to this model as well.  In two dimensions it can describe "what I think happened."  With a third dimension, it could represent "what I think I thought happened."  And so on.

            I'm sure a deeper study of philosophy and mathematics could give me a stronger academic and theoretical background for this conversation, but in the meantime, I think that modeling a two dimensional perception of time lines is sufficient for most uses:

            ID
            Attributes
            Effective_Date
            Expiration_Date
            Perception_Start_Date
            Perception_End_Date
            

            And the round-up for today...
            #pragprowrimo word count: 16,063

            Monday, November 9, 2009

            The PICA code

            I'm a member of the IS Board of Advisors at the University of Missouri St. Louis.  One of the things that this university's IS program struggles with (as do most) is how to graduate better analysts.  I've always seen strong analytical skills as a personality trait rather than something that can be learned.  As I recall my training as a consultant, though, it occurs to me that there were several lessons about how to be a good analyst, how to ask open ended interview questions, how to perform experiments and data analysis to validate assumptions, etc.

            The PICA code was a great example of poor analysis.  The project was to add a new field to the pharmacy claim table.  We went and talked with the team that was adding this field to the claim processing system to understand how it was really going to be used and what it really was.  The answer we got was "it's called the PICA code because it can have only one of 4 values... P, I, C, and A.  There was no harm done when we discovered that there were also values of N in that column, though there were lots of jokes about needing the change the column name every time a new value was discovered.

            In this case, the weak analysis didn't cause much harm, but there are certainly other cases where poor root assumptions can lead to poor decision making downstream.  I hope that other universities catch up with what UMSL hopes to do in graduating students who are stronger analysts.

            Word Count: 14,015

            Sunday, November 8, 2009

            Believing What You're Told

            One of the biggest challenges I've experienced in data warehousing is that analysis has to take place from three sometimes very divergent and contradictory perspectives:
            • How does your user perceive the nature of the business?
            • How do users of the source systems perceive their own activities?
            • How do source systems record the activities of their users?
            I've run into many times where there are various degrees of disagreement between those three perspectives.  It's the challenge of a good analyst to get to the bottom of these discrepancies, understand why perceptions are what they are, and be able to believe that there is complete truth in each of them... even if the source system only says one thing.

            Sometimes even harder than the philosophical challenge of reconciling those divergent perspectives is the act of reconciling them into a common model of understanding in the form of a reporting system or data warehouse.  Woe to any architect or modeler who takes one of the stories as gospel and disregards the others.

            #pragprowrimo word count: 12,005

            Friday, November 6, 2009

            Null Valued Fields in a Data Warehouse

            Whether or not null-valued fields should be allowed in a relational database is a classical debate.  In data warehousing, null-valued fields should be prohibited as a rule.  Not because of the open / closed world assumption in relational controversies but because of what that implies for business applications and how that impacts the usability of a data warehouse.

            As Codd states, the value of NULL by itself lacks sufficient specificity of meaning.  Does it imply that a field value is simply missing for a particular record, but might yet take on value or might have previously had value; or does it imply that the field value is not applicable for that record and never will be.  (This lack of clarity creates that open/closed debate.)  For the sake of clarity in a data warehouse, where data values are expected to provide a clear, complete, and accurate representation of business information (closed world), allowing null values create inappropriate ambiguity.

            Four different types of default values should be considered in place of a null value:

            • A logically accurate value
            • Well-defined "unknown" (Codd's A-value)
            • Well-defined "not applicable" (Codd's I-value)
            • Source specific default

            The biggest challenges in applying this rule will be to understand the underlying business operations and make a determination of what will be the most appropriate and most intuitive values for users to interact with.  Be very careful to consider what impact those choices will have on how users will access information in the data warehouse.  If it makes simple questions more complciated to answer, then it is probably a poor choice.

            Word Count: 9,401

            Thursday, November 5, 2009

            Macro Data Modeling

            One of the challenges with data warehousing is building the right data model.  There are trade-offs between normalization and denormalization, usability of dimensional models versus transactional 3NF models, data marts, and complications of slowly changing dimensions and auditability.

            One technique that I think helps build data warehouse models that are intuitive, easy to populate and tune, and also extensible as new information requirements emerge is something I refer to as "macro data modeling."

            Like the contrast between macroeconomics and microeconomics, macro data modeling examines an entity and its relationships from the perspective of the whole enterprise rather from the perspective of an individual business process.  Think of it as modeling something from a customer's perspective rather than the perspective of someone in the finance department.  Or modeling something from an enterprise perspective rather than a departmentmal perspective.

            One example of this thought process might be the tracking of packages by the USPS.  On the one hand, the system that is used for scanning packages at each handoff point in the shipping chain might store SCAN transactions.  From a micro-perspective, this makes sense because the SCAN is what is actually being done in each location.  The macro-perspective of this same operation is a model that represents PACKAGE_DELIVERY.  The PACKAGE_DELIVERY record has the current status and location of a given package, key measure about how long it has spent in various statuses, key dates and times, etc; and also has a series of history records that imply every one of the SCANs that happened along the way.  That macro data model will be easier to use for most analysis purposes partly because it is focussed around a concept that will exist forever (a PACKAGE_DELIVERY is still a valid concept even after delivery is complete) rather a purely transient transaction (SCAN) that only exists at a point in time.

            Another way to think about that particular example is through the metaphor of a state machine.  More on that in my #pragprowrimo (NaNoWriMo) endeavour.

            Word Count: 7704

            Wednesday, November 4, 2009

            Data Model Extensibility

            One of the major benefits that creating abstract tiers to a data model provides is that ability to then change the physical representation of the data independently from down-stream applications that are consuming that same data and upstream applications that are delivering the data.

            Rules of the road:
            • Users never access base tables directly.  They always go through synonyms or views even if those views are simply "select *" views.
            • ETL always uses staging tables that are source-specific.  Use default column valeues.
            • Normalize base models, but not too excess.  Normalize them to a sound degree that aligns somewhere between the shource system and the downstream data.
            --Paul

            Word Count :5,201

            Tuesday, November 3, 2009

            Layers of Data Warehouse Data Architecture

            A very effective way to develop the data architecture for a data warehouse is to think about the situation from four different angles:

            Data Storage - This layer is the actual physical data model for base data warehouse tables.  The purpose of this model is to provide a clear and concise representation of the entities, attributes, and relationships present in the data warehouse.

            Data Presentation - This layer of the data architecture is accessed by users or user tools that provide data access.  It may be as simple as a series of synonyms or "select *" views, but it provides a strict layer of abstraction between data presentation and physical data representation.  This abstraction layer, decoupling the presentation of data from the underlying storage of data, allows for changes to made independently on either side of that boundary.

            Data Staging - Like the data presentation layer, a data staging layer creates an abstraction boundary between the processes that deliver data to the data warehouse and the way in which data is represented internally within the data storage layer.  This separation allows for new and expanded data sources to feed information to the data warehouse with necessarily impacting existing data sources on the same schedule.  It helps avoid the "if I add that new source, then I'll have to rewrite another 20 ETL jobs at the same time" situation.

            Data Marts - The final layer is the business-problem-centric data mart.  Arguably, data marts can be considered solutions that are independent of a more comprehensive data warehouse architecture, but I've included it here for the sake of contrast.  In this model, a data mart is a grouping of entities and business rules that represent the information necessary to address a particular and well defined business problem.  In many situations, a data mart will be a subset of data in a data warehouse that is filtered and interpreted through specific business rules.  In some situations, data marts can be represented entirely through a logical abstraction layer (rather than through additional physical tables and the duplication of data).  Data marts created without significantly duplicating underlying information are referred to as virtual data marts.

            Word Count: 4882

            Monday, November 2, 2009

            Context Diagrams

            Admittedly, I'm a whiteboard and diagramming addict.  I'm moving between office buildings after hour tomorrow.  For the move, I've marked 5 individual whiteboards to be delivered to my office.  While I love whiteboards and the ability to capture content using the 3 MP digital camera in my phone, I've been working for the past year to train myself to diagram electronically in real-time during meetings; to use MS Visio as a live design tool during meetings.  (It has worked effectively a few times.)

            In the early phases of analysis and solution design, two of the most valuable pieces of documentation are a current state system context diagram and a future state system context diagram.  Some of the critical decision points in documenting systems analysis results and design plans include:
            • level of detail
            • accuracy assurance
            • shapes, colors, and sizes of boxes
            • labels
            • relationships
            • how many to times to ask "and then what?"
            • keep the diagram simple
            As you begin to get into the weeds of detailed analysis, there can be a tendency to increase the level of detail in one small area of an existing diagram.  That's why it can be very valuable to layout a diagramming and documentation plan ahead of time that includes the clearly stated goal and objectives for each piece of documentation.

            Diagramming is easy to do.  Documenting business systems context diagrams is hard to get right.

            Word count: 3,290

            Sunday, November 1, 2009

            And Then What?

            The 5-Whys principle made famous by Sakichi Toyoda and its application at Toyota has a very handy parallel in data warehousing analysis: "and then what?"  There hasn't been any formal statistical or mathematical analysis to determine how many times "and then what" typically needs to be asked before truly understanding a complete business process, but 5 is probably a good starting point here, too.  One of the key metaphors in data warehousing analysis is the Information Supply Chain.  In order to get a complete, end-to-end view of that supply chain, you have to keep asking the knowledge workers in that supply chain "and then what?"

            Word Count: 777

            Saturday, October 31, 2009

            T-Minus 27 Hours (to PragProWriMo)

            NaNoWriMo starts tomorrow and the folks from The Pragmatic Bookshelf are part of my inspiration to participate this year.  Over the course of the next 30 days, my wife and I will both be spending late nights up writing together.  She'll be exercising her passion and creativity through a new novel.  I'll be evangelizing my passion for the inherent value of business information and sharing some powerfully practical techniques for building data warehousing solutions that draw on and emphasize business insight -- emphasis on the word "practical."

            As I sprint through 2,000 words a day, I'll also summarize those topics and the experience here on Sharpening Stones.

            If you're writing along with PragProWriMo or NaNoWriMo, visit my user page and be my writing buddy!

            Thursday, June 18, 2009

            Aggregate / Summary

            One of my wife's pet peeves is when technologists take a word that already has clear meaning in the English language and then twist the definition to mean something not-quite-close-enough in a technical context. Great case-in-point in the world of data warehousing:

            What is the difference between an aggregate and a summary?

            I did a Google search and came up with a lot of junk floating around what I would consider some not very good answers (albeit from 2002) from some industry analysts. In fact, David Marco goes so far as to say "Summarized and aggregation are the same thing.


            In my gut, there's always been a difference between an aggregate and a summary. So, I decided to try to articulate what the difference is. In doing so, my wife's advice of "just look in the dictionary" came in very handy.


            Aggregate:
            Comes from the Latin word for "to flock together" or "to flock or group."
            Just to point out that there's nothing necessarily in that definition about the idea of reducing the specificity of something or the fact that the flock or group is made up of individuals. Rather, the idea is that there is a group of individual things acting together.

            Summary:
            From the Latin summa, one meaning of which is "total" or "sum," also the "principle or main thing."
            Clearly, the idea with this root word is that a level of detail is being removed when the summary of something is presented. Rather than still being individual things, the summary of things is another layer of abstraction that represents the underlying detail.


            So, my conclusion is that, in data warehousing application, the term aggregate can be used to represent an object that brings together various other ideas and combines them together in one place, regardless of any change or lack of change in the level of granularity. A summary, on the other hand, has to imply either a change in granularity either through mathematical means or by eliminating the amount of precision in a series of events.


            Example 1:
            If I have two fact tables, one representing purchase orders and the other invoices, I can create an aggregate that still contains all the same detail, but pre-joins those fact tables together in a new kind of fact. In order for that to be a summary, I also have to roll it up to something higher than the transaction grain.

            Example 2:
            If I have a workflow event table that I use to track how long it takes an order to go from ordered to fulfilling to packaging to shipping to billing, I can create a summary that only has ordered and shipping status records. In this case, the level of granularity hasn't really changed, but the level of detail has, so it is a summary.


            I'm open to comments on this, but it seems very straight forward given the dictionary definition of the terms. I know people may use them differently, but we're a group of data-driven experts. Shouldn't clear and precision definition of terms be something that we strive for?