Wednesday, December 30, 2009

Who's data is it? (Part 2)

In the first part of this series on who's data is it, I talked about the black-box mentality that some teams have with regard to vendor applications.  In that scenario, the hurdle to overcome is primarily one of convincing the right authorities to grant you database access.  That's the biggest hurdle, assuming the application data is actually understandable once you get at the underlying database.  Unless the application team is more loyal to the vendor than to your company, you haven't burned any important bridges in that scenario.  The vendor won't support your efforts at extracting and integrating data from the application this way, but they're the one's who put out a ridiculously high $250,000 bid anyway.  The next scenario is more politically charged, however...

It's MY data!

The "it's my data" culture sees the information behind an application as something that needs to be closely held and funneled through a controlled group of experts.  Unlike the vendor-data perspective, the my-data group is interested in sharing information outside of the application interface, but often with point-to-point interfaces that they can tightly control (as opposed to any kind of pub-sub or hub-spoke model that creates a more open access model).  They're interaction often sounds like: "You provide exact specifications with which specific fields you want and we'll send you an extract with exactly that information."

Another common response from application teams to a request for access to an application database might be something along the lines of "what are you going to do with that data?  We have responsibility for that data and it can be very complicated to understand."  This kind of response is indicative of a belief that the application team has some kind of exclusive ownership of the data and should be the only group allowed to parcel information out to users individually through reports or extracts.  At the extreme, it's reminiscent of early reporting days where a user would call someone in IT for a report; that IT person would build the program to generate the report; print the entire report on green-bar paper; and then send it through interoffice mail to the requester.  If the application team isn't already on board with a service oriented approach to exposing data to other applications or integrated reporting through data warehousing, then their natural tendency will be to create a plethora of individual point-to-point solutions across the application landscape.

This kind of architecture may lead to well defined standards and definitions of data, but either only in pockets across the organization or with an ever increasing cost of maintenance as the number of interfaces expands and complexity of that interaction increases.  There are some positive aspects to a culture that wants to protect the purity of its data and have a strong understanding of how other groups are using the same data.  Part 3 of this series will touch on that data purist counter-point.

Who's data is it? (Part 1)

The problem of data ownership:
I want to thank Jim Harris for his great post about The First Law of Data Quality from earlier this month. It's certainly an excellent read.  One of the things it reminded me of is the experiences that I've had working with various other business and IT teams when trying to get access to and understand data from particular systems.  The issue of "data ownership" always seems to play an antagonistic role in any desire to acquire and integrate information from multiple systems.  Generalizing, I think there are three different negative responses you can receive when looking to a new source system for access to their data for integration into a data warehouse:
  • That's the vendor's data.  We wouldn't be able to understand it.
  • That's my data.  I'll tell you what you need.
  • That's not my data.  I don't know anything about it.
This series of blog posts will explore each of those in more detail:

It's the vendor's data...
The "it's the vendor's data" culture has a black-box view of applications: there is no separate understanding of application and information.  So, there's a belief that the only way to interact with the data in the application is have a vendor or partner create new application features or vendor supported interfaces.

Individuals who have experience in supporting the end users of applications and sometimes the administrative configuration of applications often put significant weight in the features and functionality of the application itself.  From an integration perspective, this significantly limits the kind of integration that can be done with many applications.  For instance, most applications provide users with various types of "report writing" functionality.  In many applications, this is merely an extract-generation feature that allows users to select the fields they're interested in and exporting that data to a CSV or Excel file.  Application support analysts who are used to supporting users in this way and have, maybe, not recently been involved in technical integration work may default to a view that the only way to get information out that particular application is through these manual extract tools.  Of course, having individuals manually run extracts from a system on a daily basis is not an ideal pattern for sourcing data into a data movement process.

In situations of heavy application reliance, application teams might never think of doing integration directly from the database; or the team might feel that any such endeavor would require many hours of training from the vendor.  Of course, the vendor would appreciate the professional services or training for that.  In most applications, however, that is typically unnecessary.  Reverse engineering a data model is an easy exercise and, depending on the complexity of system.  Reverse engineering the underlying data typically takes more analysis, but having access to both the application and database (in several environments) provides a powerful way to understand both the application and the data.

Enterprise scale applications typically provide some time of standard interfaces, whether those are X12, EDI, HL7, or a proprietary way of moving data in and out of the system.  Many modern applications provide a service oriented API to allow external applications to do read or write operations to the application.  Standard interfaces and services are an excellent way to retrieve data for a data warehouse.  In lieu of sufficient interfaces, though, database to data warehouse integration is still a common, powerful, and appropriate integration model.

Convincing application teams that connecting directly to an application database without engaging and paying for professional services from the vendor can be challenging.  One effective way to break through part of that cultural difference is to do a proof of concept and show how the technology and analysis can work to create custom integration solutions without necessarily needing vendor intervention. A quick win to prove that the mysterious data behind some application really is just letters and numbers goes a long way to gaining confidence more substantial integration work.

Saturday, December 26, 2009

Where to Put Business Logic

Dave Rathburn has a great blog about SAP / Business Objects and posted in October about choices in where to put business calculations in the data warehouse / reporting stack.  I thought I'd share a couple of quick thoughts about some experiences I've had with project teams trying to decide where to put certain pieces of business logic, specifically in the ETL workflow.

For this example, the project involves the integrated reporting of accounts receivable history from about a dozen separate legacy systems.  The design question came up: how much business logic should be done by the legacy source systems who are writing the extracts using proprietary tools, and how much should be done by the ETL team who will be integrating and merging the data into the warehouse.

In the beginning of the discussion, we had two polarized and protectionist perspectives:

Argument 1: Since the source system teams are subject matter experts, they'll know quickest how to understand and translate from source system specific domain value and logic into data warehouse values.

Argument 2: It's the data warehouse team's job to do the integration work, so the logic should all go into the ETL.

The discussion degenerated for a while until I noticed that what we were really revolving around was a straightforward operational argument that is the most convincing that I've ever heard.  Here's how we summarized it at the end of our discussion.

General data warehouse design goal: put as little business logic as necessary in the source system extracts themselves.  Rationale...
This helps to reduce the number of places you have to look to troubleshoot when something goes wrong.  If most of the business logic is applied after the data is retrieved from the source system, then it's also most likely that any errors that occur happen after that initial extract step.  If there is significant business logic in both the source system extract as well as the integration ETL, then both places have a reasonable likelihood of introducing errors into the data.  Some business logic must exist in the integration ETL to be able to merge common dimensions and do the work of actually integrating the data. Therefore, having limited business logic in the source extract means that you don't have to focus on both the legacy code AND the integration ETL code when troubleshooting.

 In the end, we all agreed that we'd keep the logic in the legacy extracts as simple as possible, and put as much business logic as possible in the ETL.

Tuesday, December 15, 2009

Fun with Recursive SQL (Part 3)

This blog post, Part 1, and Part 2 are derived from an article that I wrote as an article for Teradata Magazine and for Teradata Partners conference about fun uses for recursive SQL.

Finally, to the really interesting example of how to use recursive SQL productively.  In this example, we're dealing with segments of time that are defined with a start and end time.  Imagine staggered work shifts or a work schedule that specifies when a worker is scheduled to take breaks.

Imagine a series of time segments, like those shown in the picture below.  Each segment has a start and end time as well as a priority level.

The desired output from this data is a new, single, continuous time line that denotes only the highest priority segment for any particular clock time.

Here's a simple summary of the recursive algorithm that we're going to apply to solve this challenge:
  • For each segment at a given priority level
  • Join to the first segment that appears from a higher priority level that overlaps in any way
  • Calculate the results of the overlap.  This may result in:
    • A head segment
    • A tail segment
    • Or two new segments

We'll start off with a table that has the following definition and sample data:
create table tl (
  seg_num integer not null,
  p integer not null,
  st_tm integer not null,
  end_tm integer not null

insert into tl values (1,1,100,400);
insert into tl values (2,1,445,600);
insert into tl values (3,3,130,200);
insert into tl values (4,3,315,415);
insert into tl values (5,2,345,500);

You can find the first layer of overlaps that will cause splicing of any given segment using a self-join that matches every time segment with any time segments that:
  • have a higher priority; and
  • either:
    • have a start time between the current segment's start/end time; or
    • have an end time between the current segment's start/end times.
You can see the three conditions in the WHERE clause below:

select a.seg_num, 
       row_number() over(order by b.st_tm) rnk,
       b.seg_num, b.p, b.st_tm, b.end_tm
from   tl a,
       tl b
where  a.p < b.p
  and  ((a.st_tm  < b.st_tm  and a.end_tm > b.st_tm ) or
        (a.st_tm  > b.st_tm  and a.st_tm  < b.end_tm))

The trick is that we need to do the same kind of operation on the output of this first statement to determine if there are other, higher priority segments that need to splice these further. So, we recursively call the same kind of SQL on the output. In the first attempt at the recursive version, we also add some CASE logic to output exactly what the newly created slices are.

with recursive tl_rec (
  seq_num, seg_num, p, st_tm, end_tm, b_seg_num
) as (
  select 1, seg_num, p, st_tm, end_tm, -1
  from   tl 
  union all
  select a.seq_num + 1, a.seg_num, a.p,
         case when a.st_tm  < b.st_tm and 
                   a.end_tm < b.end_tm then a.st_tm  
              when a.st_tm  < b.st_tm and 
                   a.end_tm > b.end_tm then a.st_tm
              when a.st_tm  > b.st_tm and 
                   a.end_tm > b.end_tm then b.end_tm end,
         case when a.st_tm  < b.st_tm and 
                   a.end_tm < b.end_tm then b.st_tm  
              when a.st_tm  < b.st_tm and 
                   a.end_tm > b.end_tm then b.st_tm 
              when a.st_tm  > b.st_tm and 
                   a.end_tm > b.end_tm then a.end_tm end,
  from   tl_rec a,
         tl b
  where  a.p < b.p
    and  a.b_seg_num <> b.seg_num
    and  ((a.st_tm  < b.st_tm  and a.end_tm > b.st_tm ) or
          (a.st_tm  > b.st_tm  and a.st_tm  < b.end_tm))
select * from (
    row_number() over(partition by seg_num 
                      order by seq_num desc) rnk,
    seg_num, p, st_tm, end_tm
  from tl_rec ) z
where rnk = 1

And you get the following output records:

But wait!  There's something missing.  There should be a left-over LOW segment there between the two HIGH's.  We lost something along the way... Hmm...  The output of the join we've defined can only be one record.  So, when a higher priority segment is contained within the start/end of a lower priority segment, we're only returning the leading piece of the LOW segment.  There's no way for that join to output two records with different logic, though.  Luckily, with recursive SQL, we can have more than one recursive statement.  So, we just need to add another UNION ALL to output the trailing LOW segment.

with recursive tl_rec (
  seq_num, seg_num, seg_split_num, p, 
  st_tm, end_tm, b_seg_num
) as (

  ... repeat from previous SQL ...
  union all
  select a.seq_num + 1, a.seg_num, a.seg_split_num+1, 
         a.p, b.end_tm, a.end_tm, b.seg_num
  from   tl_rec a,
         tl b
  where  a.p < b.p
    and  a.b_seg_num <> b.seg_num
    and  (a.st_tm  < b.st_tm  and a.end_tm > b.end_tm)
select * from
    row_number() over(partition by seg_num, seg_split_num 
                      order by seq_num desc) rnk,
    seg_num, seg_split_num p, st_tm, end_tm
  from tl_rec
) z
where rnk = 1

Phew! With that combined SQL, now we finally get the expected output!

Personally, I recommend studying through that a few times so that you've really internalized the strategy. It certainly took me a while to actually get it to work correctly. Then look around at your existing projects or challenges you've faced in the past and see where else you can apply this same kind of recursive technique.

Good luck!

Saturday, December 12, 2009

Fun with Recursive SQL (Part 2)

This blog post and Part 1 are derived from an article that I wrote as an article for Teradata Magazine about fun uses for recursive SQL.

Part 1 of this series described a way to use recursive SQL to build up a single comma-separated field from a set of rows that were all part of the same grouping. Part 2 describes how to do just the inverse: take a field that has a comma-separated list of values and produce one row for each of the items in that list. What I refer to as "string normalization."

In the code below, the seed statement is simply a select * from our source table. The recursive statement takes the field to be pivoted into multiple rows and uses the POSITION() function to repeatedly pull off just the first item in the list. The recursion continues until our POSITION() calculation in the "dpos" field tells us the field is blank.

The context for this example is a set of survey responses in a table that contains a RESPONSE_KEY (just some primary key) and RESPONSE_TXT (a comma-separated list of responses). The SQL below parses that RESPONSE_TXT field into a single elements and outputs one row per element along with the associated RESPONSE_KEY.

       (response_key, delim_pos, item_num, element, remainder) AS 
    SELECT response_key, 0, 0, CAST('' AS VARCHAR(100)), response_txt
    FROM   response
    SELECT response_key,
           CASE WHEN POSITION(',' IN remainder) > 0
             THEN POSITION(',' IN remainder)
             ELSE CHARACTER_LENGTH(remainder) END dpos,
           item_num + 1,
           TRIM(BOTH ‘,’ FROM SUBSTR(remainder, 0, dpos+1)),
           TRIM(SUBSTR(remainder, dpos+1))
    FROM   parse_list
    WHERE  dpos > 0 
SELECT response_key, element 
FROM   parse_list p
WHERE  item_num > 0
ORDER BY response_key, item_num;

If the input RESPONSE table had the following data:
643            eggs, milk, toast
872            cereal, juice, bagel, cream cheese

Then the output would look like this:
643           eggs
643           milk
643           toast
872           cereal
872           juice
872           bagel
872           cream cheese

If you're new to recursive SQL, take the time to read, try, and experiment with these first two uses. Part 3 of this series will describe a way to consolidate multiple overlapping time segments, using some specified order or precidence, to create a single consolidated time line. That sounds confusing and potentially worthless until you see the real world example.

Friday, December 11, 2009

The "Santa Intelligence" Team

Thanks to Henrik L Sørensen for his great commentary about the challenges that Santa must experience with regard to data quality (and inspiring this response).

It turns out, little did we all know until this year, that Santa does indeed have a Business Intelligence team.  This year, Disney released a new Christmas special called Prep and Landing that explains some of the secrets behind his success with 3 billion customers.  Warning - spoilers ahead.

In the story of Prep and Landing, we learn first that Santa has a tactical team of elves that go out to houses before he gets there to prepare them for his arrival.  They check to make sure everyone is asleep, stockings are prepared, etc.  Our main protagonist, Wayne, is aspiring for a promotion to Director of Naughty List Intelligence.  We never learn much about this highly prized position (except that Wayne is more suited to operations than analytics), but I think it's easy to speculate based on some of the things we see in the movie:
  • Create ideal flight plans for Santa to optimize his travel around the globe to hit all the Nice List houses most efficiently, taking into account not only the location of the houses, but also the weight of presents to deliver as many high-weight presents as early as possible.
  • Ensure quality of the Naughty List cross-checking names, addresses, family members, and friends to guarantee that no one is accidentally misidentified to be on the list.
  • Develop new Naughty/Nice algorithms and expand the base of behavioral information used in determining which children belong on which list.
  • Identifying "might also like" gifts to add as special surprises for children who reached the top decile on the Nice List.
  • Optimize the priority order for gift assembly throughout the year so that all of the presents of the predicted Nice List are ready by 12/24, without making too many gifts for kids who might end up jumping to the Naughty List as the last minute.
  • Create an optimal work-out schedule for Santa to ensure that he can burn off, over the course of the year, all the extra cookie and milk calories he consumes on Christmas Eve.
 I can only imagine what kind of data warehouse platform Santa must be running.  I'm sure his data center is there at the North Pole... to keep the cooling bills down to a minimum.

Fun with Recursive SQL (Part 1)

This blog post is derived from an article that I wrote as an article for Teradata Magazine about fun uses for recursive SQL.

Show of hands: how many of you have used recursive SQL in the past six months? For something useful?  Using ANSI syntax rather than Oracle's CONNECT BY?  If you said "yes" to all three, then congratulations!  You've won a prize: 15 minutes of your day back.  You can jump down to the bottom of this post and just review my fun example.  This first part is just an introduction to recursive SQL.

Introduction to Recursive SQL
First let's just review the definition for "recursion."  Handy-dandy dictionary here.... recourse... rectify.... recursion:
See: recursion
Not very helpful.  How about this instead.  A recursion is a method of solving problems in which the algorithm or function applies itself in the execution of the solution.  For instance, if I were to define a recursive algorithm for summarizing a list of numbers, I could say something like "the sum of a list is determined by adding the first number in the list to the sum of all the remaining numbers in the list; if the list is empty, then the sum is zero."  You notice how the definition of "sum of a list" includes the use of the "sum" operation, as in "sum of all the remaining numbers."  You'll also note that there's an explicit terminal condition to define what happens when you have nothing in the list.

So, recursive SQL works the same way.  A recursive SQL statement is one in which the determination of the result set requires the execution of SQL on the result set itself.  In recursive SQL, there has to be a seed statement, as well as the recursive call, and a termination condition.

WITH RECURSIVE [temp table] [column list] AS
  [seed statement]
  [recursive statement]

A Simple Example
For the sake of comfort, a simple example might involve an employee hierarchy table where each row had the employee ID and associated manager ID.  A standard query could retrieve a list of direct reports for a given employee ID:  SELECT * FROM employee WHERE mgr_id = 123.  A recursive query, however, could return the list of all employees anywhere under the hierarchy below that same manager.  An example that might be:
WITH RECURSIVE emp_hier (emp_id, mgr_id, level) AS
SELECT a.emp_id, a.mgr_id, 0 
FROM   employee a
WHERE  a.emp_id = 123
SELECT b.emp_id, b.mgr_id, c.level+1 
FROM   employee b,
       emp_hier c
WHERE  b.mgr_id = c.emp_id
SELECT e.emp_title, e.emp_id, e.mgr_id, h.level
FROM   employee e,
       emp_hier h
WHERE  e.emp_id = h.emp_id
  AND  e.emp_id <> 123;

In this query, the is a simple select that pulls the manager record for which we want all of the descendants.  The selects all records from the employee table where the employee is managed by someone already in the emp_hier temporary table – hence the join on employee.mgr_id and emp_hier.emp_id.

On to some more exciting examples...

Example 1: String Concatenation
Ever notice that there are no string aggregation functions in SQL?  Something like a SUM() for character fields that would take two parameters: the field or expression from each row and a delimiter.  Usually, programmers end up having to write code outside of the database to do that kind of string manipulation. Recursive SQL can achieve that kind of character aggregation quite nicely if used appropriately.

Consider a list of email addresses for several members of the same family.  The table has a FAMILY_ID, PERSON_ID, SEQuence number, and EMAIL_ADDR.  The desired output is a single row that contains a comma-separated list of all the email addresses for the entire family (for a mailing list perhaps).

WITH RECURSIVE email_list (coverage_id, emails, seq) AS
  SELECT m.family_id, m.email_addr, m.seq
  FROM   members m
  WHERE  m.seq = 1
  SELECT e.emails ||’, ’|| m.email_addr, m.seq
  FROM   email_list e, members m
  WHERE  m.seq = e.seq + 1
    AND  m.family_id = e.family_id
SELECT r.family_id, r.emails
FROM   email_list r
WHERE  r.seq = (SELECT MAX(h.seq) FROM email_list h 
                WHERE h.family_id = r.family_id 
                GROUP BY h.family_id)
Let's walk through the SQL piece by piece.  First, the seed statement starts the result set off by picking just the first person from each family (that is, the people with a SEQ=1).  That sequence number is admittedly a bit artifical and contrived to make the recursive SQL easier to write.  If you needed to, you could generate the SEQ on the fly using a RANK() of people within each family and then leverage the RANK() as a sequence number.  Hopefully the example isn't lost because of that bit of confusion.

The recursive portion of the SQL simply takes what was seeded in through the first query (i.e. people with SEQ=1) and joins that to all the records in the EMAIL_ADDR table with a SEQ = SEQ + 1 (from result set).  The EMAILS output field is created by concatenating the email address from the seed to record with the next highest sequence number.

Then that result set is taken and the same thing is done.  And so on.  And so forth.

The query terminates when the result set contains no records.  That is, when we exhaust all the SEQ numbers in the table.

The outer query refines the results down to exactly what we're looking for by keeping the final record and throwing away all the intermediate records that were used to built up the results.

Next Time
Look ahead for additional posts on how to use recursive SQLto parse a field with comma separated values and how to split and reconcile overlapping time segments into a single, flat time line.

Wednesday, December 9, 2009


If a man begins with certainties, he shall end in doubts;
But if he will be content to begin with doubts,
He shall end in certainties.
[Francis Bacon 1561-1626]

When I was learning to program in C and studying algorithms, the assert() assertion macro was one of my favorite debugging tools.  Assert can be used to validate that something isn't going wrong that could send your program into left field during the execution of some procedure.  For instance, a balanced binary search tree should never be more than log2(n) levels deep (or something similar to that based on the exact insertion algorithm), where n is the number of items in the tree.  After a new item is inserted in the tree, you can assert(tree.depth() == log2(tree.count())).  If that assertion fails, then you know the tree isn't staying balanced and the search performance guaranteed by a balanced tree isn't valid any more.

If that's too much computer science for you, hold on and see where this is going.  There's relevance to this idea beyond low-level programming and computer science theory.

I've been in many conversations with data warehouse sponsors that focused on the question of "how are you sure that the data in the warehouse loads correctly every night?"  One of the better ways I've found to approach this kind of data integrity assurance is to think about what kinds of assertions can be found throughout the batch ETL processes that I create.

For this example, suppose a somewhat traditional sort of ETL process that happens in the following steps:
  1. Copy or extract raw data from source system
  2. Detect changes from last pull
  3. Lookup surrogate keys and other translations
  4. Apply deletes (as soft-deletes with setting exp_date = current_date())
  5. Apply inserts
  6. Apply updates
Now, read the wikipedia article about assertions and test-driven development and think about what kinds of things you may want to be able to assert during each phase of the ETL.  (Excuse my sort of object oriented pseudo-code.)
  1. Clearly after the raw data is copied from the source system, you can test to see if you've captured all of those records in your own work area, or possibly perform a sort of checksum on the data to ensure the integrity of the copy.
    • assert(source.count() == mycopy.count())
    • assert(source.checksum() == mycopy.checksum())
    1. If you have to manually detect changes in the source data, rather than relying on source system audit information, there are other assertions that you can perform during that change detection process.
      • assert((source_previous.count() - source_current.count()) == changes.insert_count() - changes.delete_count())
      1. After looking up dimension or reference surrogate keys, you can check that you haven't lost any records and that all the reference surrogate keys have valid values (if that's appropriate for your implementation.)
        • assert(record.dimension_key > 0)
        • assert(input.count() == output.count())
        1. After deletes are applied, you should be able to test that the number of rows with a current last-updated audit timestamp and expiration date of today should match the number of rows marked for delete.
          • assert(changes.delete_count() == target.count(where last_update==today() and exp_dt==today())
          1. After the inserts are complete, the insert count can be measured using the inserted audit timestamp.
            • assert(changes.insert_count() == target.count(where  insert_audit==today)
          2. And the updates can be measured using the last-updated audit timestamp and an expiration date greater than today.
            • assert(changes.update_count() == target.count(where last_update==today() and exp_dt > today())
          3. We may also choose to assert that we aren't trying to update any records that the source systems says were deleted.
            • assert(target.count(where last_update==today() and exp_dt < today())
          Your ability to create reasonably easy to execute assertions at the batch level and at the individual record level  depends on how your data warehouse and ETL are designed, of course.  Whether working on an existing system or starting something new, look at data quality integrity assurance from the perspective of traditional asserts and see if that changes how you consider building those ever important data quality assurance jobs.

            Tuesday, December 8, 2009

            Whiteboard Fail

            This is a story about how the overuse of whiteboards can be a sign of something worse and create a risk to project success...

            I once worked with a project manager who was leading a particularly difficult data warehouse project.  The complexity came form both the number of different source systems that were sending information into all of the same data warehouse tables as well as the data marts that were used to answer specific questions based on some (but not all) of the source systems.  When I had my first meeting with this lead to talk about the state of affairs, he'd been on the project for nearly twelve months and explained that they were a couple of months behind schedule, but working lots of overtime to catch up.  I asked him to explain the nature of the data and the solution his team was building.  He spent ninety minutes diagramming on his white board.  Starting from a general overview of the solution and drilling down to various nuances and complexities of the data itself.  It was a superb series of diagrams.  He used color and gestures and inflection in his voice to emphasize various points.

            When he was wrapping up and erasing the last diagram, I asked him if there where any electronic copies of these diagrams he'd been reproducing so deftly on the board.  "No.  We've never gotten around to actually drawing those," he responded.  "We've been too busy with the implementation."

            Fourteen months later, when he burned out and left the company, there were still no electronic versions of those system diagrams and the project was more that six months behind schedule.  Within days of his appointment, the new project manager had electronic versions of those same images that had once only been in the former lead's head, and within two months the project was beginning to recover some lost ground.

            While everyone on the project team had in their heads the same general concept for the project, the fact that he never put them down electronically was an implicit way of not sharing with the rest of the team.  The rest of the team didn't feel a comfortable connection or trust with the fundamental designs they were working with and spent more time than necessary questioning their design and implementation decisions.  As soon as the diagrams were on paper and published throughout the project team's workspace, decisions became more obvious, team members were more confident, and work began taking less time.

            Obviously, I'm not saying that whiteboards aren't second only to whiteboard paint on the list of greatest inventions of all time.  Still, there are some ways to make whiteboarding more effective:
            • Purchase smart boards
            • Capture whiteboard contents with a digital cameras and an image clean-up tool like scanr
            • Use a tablet PC and overhead
            • Learn to diagram in Visio (or other tool), live during a meeting

            Monday, December 7, 2009

            Data Quality - A Family Affair

            Grandma's lesson about taking responsibility for data quality.

            When I was a young child, we spent every Thanksgiving with my paternal grandparents in Denver.  There are two particularly memorable things about those visits.  First, even into the late 1980's, my grandparents didn't own their own telephone.  They rented their phone from the telephone company.  It was the same rotary dial phone they'd had for years, hanging in their kitchen, with an extra long handset cord attached so they could stretch across the dining room or kitchen while still talking on the phone.  Second was the important lesson that I learned about doing dishes by hand.

            Doing dishes by hand is ideally a three person job: one to wash, one to rinse, and one to dry.  The lesson that my grandmother taught me about washing dishes was that the drier is the person accountable for making sure the dishes were clean when they went back into the cupboard.

            As data warehousing professionals, we spend a fair amount of time and energy arguing that data quality is something that has to be fixed up stream, by applications.  My grandmother would insist that sending the dishes back to the washer is not our only option.

            If a dish comes to the drier not quite clean, there are three options:
            • send the dish back to the washer to be cleaned again from the beginning with soap;
            • send the dish back to the rinser to have the mess rinsed off with some hot water; or
            • use a little extra effort and wipe off the mess with your dish cloth.

            Ideally the dishes come to us clean and ready to dry.  It's a lot less work to dry off some steaming droplets of water and put a nice clean warm dish away in the cupboard than it is to notice that little bit of bread from the stuffing that didn't quite get cleaned and have to use the tip of your fingernail through a dish cloth to get the crumb off.

            What are the downsides of sending the dish back through to be rewashed from the beginning:
            • the washer has to stop in middle of scrubbing that big pan to rewash the plate;
            • the plate takes longer, overall, to be rewashed, rerinsed, and redried;
            • both the washer and rinser have to redo work.
            Perhaps the same is true in terms of data quality.  If a transaction moves from system to system and doesn't come out the other end quite exactly clean, because some of those business processes in the middle aren't quite exactly flawless, is it always the best choice to go back to the beginning to find just where things went wrong and correct them there?

            I'm not suggesting that any application is allowed to be intentionally lazy about data quality, or should not correct issues that are identified.  Rather, I'm suggesting that we make sure we all continue to see data quality as our responsibility and not merely blame up stream systems when there is something that could be done at various points in the chain to ensure quality information is used for decision making.

            Sunday, December 6, 2009

            The Unix Philosophy Applied

            Applied to Application Glut...
            I've always been a big fan of the Unix Philosophy.  It describes a lot of my thinking about systems development and enterprise architecture.  There are times that I wish that application implementers could see the same principle in their own work, too.  Here's an example:

            Company X does a major ERP implementation, standing up an enterprise wide GL, AP, HR, PR, and SCM solution.  A couple years later, the philanthropy group decides that they need a consolidated view of the several different applications that are being used to manage prospective donors.  They decide to buy a finance package from the same vendor that supports their prospect management solutions.  When questioned about why they didn't just use the enterprise ERP solution, they explained that it was "too complicated.  We'd have to set up a new company and a new account structure to support the philanthropy work."
            It's too hard to take my clothes downstairs, figure out how to use the washing machine, wash them, dry them, fold them, and put them away.  I'll just go to the store and buy some new clothes.
            ESR, in his description of the Unix Philosophy, declares the Rules of Parsimony: Write a big program only when it is clear by demonstration that nothing else will do [link].

            I think the same applies to an enterprise environment: build or buy new software only when it is clear by demonstration that no existing application will do.

            Company Y has an central, enterprise-wide, standard ERP system.  It also has a separate budget application.  Separate applications for AR.  Separate applications for capital management and capital budgeting.  Separate applications for contract management.  And lots of duct tape and glue holding all of these things together.  Sometimes the glue has to be held on with duct tape because one system is made of a material that glue won't stick to very well.  I'm stretching the analogy.

            I think the Unix Philosophy applies in these situations, though; or maybe it's the MacGyver philosophy where you take a long a basic set of tools and then work with whatever you have on hand.

            Applied to Data Warehouses...
            Dan Linstedt wrote a blog post a couple of years ago that describes one way that this applies to data warehousing as well: Federated Star Schemas Going Super-Nova.  I've used Dan's argument more than once to describe the risks of using a strict dimensional model for a transaction grain enterprise data warehouse.  The end result is a data warehouse in which stars only loosely relate to each other at high levels and the same concepts get duplicated inappropriately.

            I think that this philosophy is especially relevant during tough economic times.  More often than we think, we can use what we have, even if it isn't tailored exactly to what we would ideally like to have.  If you can use a little bit of customization to make things work, that's better, in the whole, than buying or building a whole new system that will require lots of glue to integrate into your enterprise suite of solutions.

            Friday, December 4, 2009

            PragProWriMo Reflection

            My Gratitude:
            I want to thank @praglife and The Pragmatic Bookshelf for the spur to participate in NaNoWriMo this past month.  I've always wanted to write a technical book and now I can certainly say that I have, whether I ever publish it or not.  So, "thank you."

            The bigger thanks goes to my wife, a 4-year veteran NaNoWriMo writer.  It was fun to write together this year rather than feel like I was merely in a support role while she wrote a novel over the course of a month.  We really enjoyed writing together, and she provide me with great encouragement.  "Thank you!"

            My Experience:
            I'm not sure what goals other PragProWriMo writers set for themselves this past month, but I (perhaps ambitiously) followed the NaNoWriMo goal of 50,000 words.  In our house, the goal was 2,000 words per day with one day off per week.  That still keeps you ahead of the standard 1,667 word per day pace, even with the day off.  Our house is also occupied by two wonderful, but often time consuming toddlers.  So, writing took place mostly after 9:00 PM or maybe a quick couple hundred words during the day on the weekend.  2,000 words per night, for me, was about two hours per night.  Depending on your own sleep habits, experience with small children that get up early, and penchant for sympathy, respond to that description of the time commitment as you see fit.

            I did start my endeavor with some topics that I wanted to cover within the scope of The Practical Data Warehouse.  I had 47 ideas that needed to be covered throughout the book, and felt like that was probably a dozen or so short, but I could pick those up along the way.  So, I started writing at midnight on November 1st.

            Despite the fact that I tend to start most of what I write with the introduction and executive summary sections, to lay the ground work in my head for what the rest of the paper should convey, PragProWriMo started right in with the first chapter of true techncial content.  Things moved fairly quickly at first.  Most of what I wanted to write was already in my head, just waiting to transfer via keyboard.

            As I started writing, various non-sequitor ideas would float through my head, too.  All of those went into a backlog list or as a note in some other skeleton section somewhere else in the book.  I wrote just enough in those notes to make sure I could remember what the topic really was supposed to be about.

            That said, I certainly didn't compose linearly.  I spend the first couple of days in section 2, then jumped to section 5, back to 3, then off of section 1 for a little while.

            Most of the energy to keep writing came from the experiences my wife has had writing novels for NaNoWriMo in the past.   Sometimes, other tools were necessary, too: chips and very hot salsa, candy, Starbucks.  Things like that.

            The internal motivation, though, came from the fact that I have accumulated a lot of thoughts and ideas about data warehousing, in particular, throughout my career; and I thought there was something more to be said beyond the books and the blog posts that I read.  My reading and experience tells me that many data warehouse projects struggle because people don't know what works and what doesn't.  That comes in primarily two different types: either people are technically unqualified to elegantly address the challenges of data warehouse implementation or they don't understand the purpose and potential value of data warehouse solutions.  So, I'm hoping that what I have to say can fill a gap there and bring people from either of those extremes toward a better understanding of data warehousing.

            The Challenges:
            My biggest writing challenge has always been coming up with an idea that I think someone else will think is worth writing.  I've written a few magazine articles before, but only when I thought I truly had a novel idea that I wanted to share with the world.  I'd have written more, but that's a pretty high standard, I think.  So, in the course of writing the first draft of The Practial Data Warehouse, I had to fight that demon that kept telling me what I had to say wasn't worth reading.

            The other major challenge was exhaustion (see previous section about Toddlers that get up early).  There were several nights where I nodded off at the keyboard, and woke to find a few garbled sentences and 1,652 letter "P"s in the middle of my document.

            The Next Steps:
            My plan from here out is to rest on The Practical Data Warehouse for another couple of days, and then start revising.  One topic a night for an hour.  I think that most nights, I'll end up with a blog post that is related to the topic in the book.

            As I go through topic by topic, I expect that I'll need to do a few more experiments to create examples for some of the content ideas.  That should be fun.  As a current PHB, I don't get to write nearly as much code as I used to.

            In a month or two... off to a publisher!!

            Wednesday, December 2, 2009

            The Agility of Touch-It / Take-It

            A parable about the agility that "Touch It, Take It" adds to data warehousing; and the extra work that a misuse of "You Aren't Going to Need It" creates.

            Once, there was a great chief called Yagni.  Chief Yagni's village was very prosperous and had grown much during his rule.  Eventually, Chief Yagni decided that it was time for the village to have a new gathering space as the old one had been well out grown.  So, Chief Yagni recruited 2 strong men, Gwyn and Titi, to bring stones from the quarry to the village center so that a group of builders could stack them into the new gathering space.

            Gwyn and Titi both arrived on Monday morning to receive their directions from Chief Yagni.  Yagni reviewed the building plans and told Gwyn and Titi that he needed 20 large flat square stones for the base of the building.  Gwyn and Titi took their push carts down to the quarry, gathered rocks, and returned to the village center.  They emptied their rocks together in piles for Chief Yagni's review.  Gwyn's pile had exactly 10 flat square stones.  Titi's had 10 flat square stones and 3 smaller angled stones.

            "Why are these here?" asked Yagni.

            "I had to pick them up off of the flat stones in the quarry," replied Titi, "so I thought I would just bring them along in case there was a use for them."

            "Get rid of them!" shouted Yagni angrily.  "You've wasted time in gathering those worthless rocks I did not ask you to collect.  We need 10 tall narrow stones for the doorway.  Go back to the quarry and bring me those.  Only those, Titi!  When you see the other rocks, tell yourself that You Aren't Going to Need It."

            Gwyn smiled at Titi's scolding, feeling proud that he'd followed the chief's directions so precisely.  Titi still believed that the angled stones might eventually come in handy, though.  Gwyn and Titi began pushing their carts back to the quarry.  Gwyn's light and empty.  Titi's partly full of unwanted rocks.

            Frustrated, not wanting to push the angled rocks all the way back to the quarry, Titi dumped the extra rocks in a small pile just outside of chief Yagni's sight.

            Gwyn and Titi gathered the tall narrow stones the chief asked for.  Titi, again, had to clear angled rocks from on top of the narrow stones, and added them to his cart.  This time, Titi paused and dropped the extra angled stones on his pile just outside of Chief Yagni's sight.

            Gwyn and Titi returned to the chief with their carts full of only tall narrow stones and the chief was pleased with them both.  This continued for several more trips until the new meeting place was nearly complete.  Gwyn following directions exactly and Titi always bring back more than asked, but leaving the extra angled rocks on his pile.  By this time, Titi has accumulated a large enough pile of angled stones to fill his entire cart.

            On their last delivery to Chief Yagni, the chief looked over the plans again and stroked his chin in thought.  "Gwyn, Titi," he said.  "I need you to bring some angled stones for the roof.  Like those that you brought back on the first trip, Titi.  Go back to the quarry and bring me two carts full of those."

            Gwyn and Titi hurried back toward the quarry.  Gwyn went to the quarry and began collecting his cart full of angled rocks, but Titi had the large pile he had been accumulating during his other trips.  He stopped just outside of the chief's sight, collected all of his angled rocks into his cart, and returned to the chief well before Gwyn had even loaded half of his own rocks.

            "Titi.  How did you gather these rocks so quickly, when Gwyn hasn't returned yet?"

            Titi explained to his chief that when he had to pick up a stone anyway, he decided that if he had to Touch It, then he should Take It.

            The chief was please with Titi's foresight and promoted him to lead rock collector.