Sunday, November 29, 2009

#PragProWriMo Complete!

In my #PragProWriMo writing marathon, I've completed most of what I set out to say as well as a few more topics. After 50,058 words (around 200 page), I know there are still some significant holes in The Practical Data Warehouse.; Research to do. Annotations and references to add. Several sections to completely scrap.  Many sections to rework.

I think it does have a good solid outline, though; Something that I can work from now and make a go of.

Consume: The 4th "C"

A data warehouse, as with other technology solutions, is a partnership of those who build it and those who use it.  Both have a responsibility for the success of the data warehouse.  Therefore, consume is the fourth "C" of data warehousing.

If the sound of a falling tree occurs in the woods and no one is there, did the tree fall?

The data warehouse is fundamentally a communication tool.  The information that it conveys through its data and the representation it provides of business activities can be incredibly valuable.  All the work that goes into collecting, correlating, and conveying information in the data warehouse is wasted if the right decision makers are not consuming the information.  Consumption is not only about the technical ability for some abstract decision maker to use the data warehouse for analysis.  Rather, it is about ensuing that each enhancement to the data warehouse helps to drive adoption and increase both strategic and operational use of the system.  As users have a positive experience that aligns to their immediate business need, they're more likely to come back to the data warehouse with their next question, rather than striking off on a journey to collect information from various systems, correlate that data together, and then convey that information to decision makers and senior leaders.

#PragProWriMo word count: 50,058 - SUCCESS!!

Saturday, November 28, 2009

Convey: The 3rd "C"

If a tree falls in the woods and no one is there, does it make a sound?

One of the primary responsibilities of the data warehouse is to take information that would otherwise not be available and deliver it to decision makers and analysts who can do something valuable with the information and make the world a better place.  Perhaps you might think that's a bit of a stretch, or perhaps I'm being overly optimistic.  However, my position is that we are more likely to make decisions that move society forward if we have more knowledge about the context in which we're making those decisions.  So, even if we don't always use better information to make better decisions, we couldn't even have the opportunity to try to make decisions if not for tools like the data warehouse.

Effectively conveying information is a factor of the format of the information, performance of the system, the accuracy of the content, and the ease of use with tools that will be used to access the information.  Most importantly, the data warehouse has to be able convey to users, in a natural way, how to find the information they need and what aspects of the physical implementation impact how the system can and can't be used.

#PragProWriMo word count: 48,061

Thursday, November 26, 2009

Correlate: The 2nd "C"

Collecting information from a variety of sources is powerful in and of itself, but having an integrated collection of data makes analysis across business processes, subject areas, and source systems many times more efficient.  The data warehouse must provide an access layer that allows users to easily and naturally merge together pieces of business data that are naturally related in the business model.  There are some aspects of correlation that are business driven and the data warehouse will be a victim or beneficiary of.  For instance, the consistent assignment of employee numbers across applications or an enforced uniqueness of cost centers across multiple financial systems or the corporate data governance decision that only one systems can be considered the authoritative source for any particular data element. 

Other correlation activities are ones that can be supported directly within the data warehouse environment:
  • Code set translations ensure that a user can reference any data source using the approved corporate standard code set for some particular attribute.
  • Cross-reference translates that allow two different identifiers for the same particular entity to be combined together.
  • Transformation of one representation of a particular identifier into a matching representation somewhere else.  For instance, if one system uses a 15 digit cost center that includes accounting unit, account, and sub account, the data warehouse should take that 15 digit string, break it into components, and match those against the information from the finance system.

Having a data warehouse to preexecute these kinds of correlation activities once and store the results for convenient use in tens of thousands of queries every month or even every day is a huge benefit to analysts through clear communication and improved performance.

#PragProWriMo word count: 46,141

Tuesday, November 24, 2009

Collect: The 1st "C"

The 4 C's of Practical Data Warehousing take a very open approach to describing what a data warehouse is responsible for doing.  Perhaps, the function that can be interpretted most broadly from a technical perspective is the first C: Collect.

The first job of a data warehouse is to bring together, in one business tool, the information that an analyst, knowledge worker, or decision maker needs to understand business operations.  Whether that collection of information is a literal copy between data stores, a federation of databases, or a more complicated transformation of transactional data into a dimensional model is merely an implementation detail.  Some of those choices are more or less practical depending on the underlying systems and data (which we'll discuss later), but to be of any value, the data warehouse has to do th collecting of information together so that users don't have to.  One of the key benefits of data warehousing is the ability of an analyst to have a one-stop-shop for most of the information they need to do their analysis.  In analytically immature organizations, analyst will typically spend 80% of their time collecting data and putting it into some kind of local data store (which might anything from flat files to MS Access to small databases) and only 20% of their time doing analysis on that data.  One of the goals of the data warehouse is to flip that ratio so that knowledge workers are able to spend 80% of their time analyzing business operations and only 20% of their time retrieving data from as few different sources as necessary.

When various analysts from different departments (marketing, strategic planning, sales, finance, etc) all ask the same people for the same data on a continual basis, it also prevents the application teams from having time to make improvements or plan upgrades to the applications themselves.  There are still organizations that have multiple staff members dedicated to the work of fulfilling data extract requests to support internal analytical needs.  A data warehouse that collects the information together once, into a common enterprise model of business activities, satisfies all of those individual departments with one extract from each source system and a consolidated environment in which to do their analysis.

Saturday, November 21, 2009

The 4 C's of a Data Warehouse

Anyone shopping for an engagement ring is familiar with the 4 C's of diamond quality: cut, clarity, color, and carats.  While there are other things that make one diamond better or worse than another, these are the most commonly used.

So, what are the 4 C's of data warehouse quality:
  • Collect
  • Correlate
  • Convey
  • Consume
These are the 4 things that a data warehouse has to be most effective at to achieve success.  Over the next four posts, I'll describe what each of these represents and ways that you can measure the quality of your own data warehouse implementation against these.

#PragProWriMo word count: 36,152

Wednesday, November 18, 2009

Analysis of Algorithms for Data Warehousing

This topic was covered briefly in a previous post, but understanding algorithms in SQL can be very powerful.  I hope to shed some additional light on this concept with this post.

Contrary to what many programmers will tell you, SQL can be a very powerful language.  Like any language, it has it's limitations and idiosyncrasies, and despite what database vendors might tell you, the database doesn't always know the best way to execute a particular query.  Database query optimizers are limited by their imperfect knowledge of what a user wants via SQL and the imperfect statistics that it has about the underlying data.

One option is to use a hinting feature that some database management systems provide.  Hinting gives the SQL developer a means to instruct the database what order to perform joins or which indexes to use or not use.  It can be powerful, but it also depends greatly on the demographics of the underlying data.  If the data changes dramatically, then the same hints may no longer provide the optimal execution plan.

Another way to improve SQL performance is to reconsider the algorithm you write as a developer.  Like any programming language, SQL typically provides you more than one way to solve any particular problem.  A developer's ability to choose the most optimal implementation depends on knowledge of the language and experience with mathematical techniques like Big-O notation.  Data warehouses can be particularly challenging environments to do this type of analysis of algorithms because the queries being posed can be very complex.  Unraveling the business logic and reimplementing it with another SQL algorithm can become very complicated.

Consider the following typical example, though.  You have a transaction table that records all of he individual states that a particular patient may go through as they stay their course in a hospital:


A typical question would be "What is the duration, start, and end time for each patient's visit?"  You might choose to write this by joining the event table to itself, on the one side filtering for "admit" event types and on the other side filtering for "discharge" event types.

  a.patient_id, a.visit_id,
  a.event_ts, d.event_ts, d.event_ts - a.event_ts
  event a,
  event b
  a.patient_id = b.patient_id AND
  a.visit_id = b.visit_id AND
  a.event_type = 'admit' AND
  d.event_type = 'discharge'

This SQL is a straight forward algorithm that says I want to match up all the admit records with their appropriate discharge records and return the matched records.  Assuming that most of the records are either admit or discharge, the database is likely to do a full table scan and filter for the admits, and use an index lookup to retreive all the corresponding discharge rows using the patient_id / visit_id primary key.  The Big-O notation for that would be something like O( n log n ).  That's O( n ) for the table scan times O( log n ) for the index tree search to find the matching record.

A completely different SQL approach to solving the same problem might be to aggregate the admin and discharge rows together in a single pass through the table and then compute the results from that.  That algorithm might be written something like this:

  patient_id, visit_id,
  MIN(CASE WHEN (event_type='admin') THEN event_ts END) as admit_ts,
  MIN(CASE WHEN (event_type='discharge') THEN event_ts END) as discharge_ts,
  discharge_ts - admit_ts as duration
  patient_id, visit_id

This query will scan through the table once and build the patient_id/visit_id groups in stream along the way.  Computing these results is an O( n ) operation -- significantly more efficient than the common O( n log n ) algorithm.

While SQL is not a particular robust or flexible language, there are still many inherent opportunities to write either very efficient or very inefficient code.

Some powerful tools of the trade to learn in SQL optimization include:
  • Single-pass alternatives (as in the example)
  • Recursive SQL (e.g. with recursive or connect by)
  • Analytical functions and options (e.g. aggregates with their own partition by / order by)
  • Multi-grouping extensions (e.g. cube and rollup)
  • How these are affected by different database technologies (MPP, columnar, associative)

#pragprowrimo word count: 30,137

Sunday, November 15, 2009

What is The Practical Data Warehouse?

There are other books about data warehousing, many written by individuals with very insightful and wise techniques that can be used to build data warehouses.  Why did I start working on a new book?  Many data warehouses have been built with a strict methodical approach in mind.  "Follow these steps and we'll achieve success."  In my experience, the result of those data warehouse initiatives may meet initial user expectations but often fail to be able to support long term growth, expansion, and maintenance.

I believe that one of the key gaps in these implementation is the creativity required by an engineering design approach.  This book is designed for thoughtful and creative technicians who are looking to apply engineering principles and creative solutions to data warehousing challenges and deliver solutions that achieve short term success as well as long term sustainability.  This book is about principles and practices.

#pragprowrimo word count: 24,908

Friday, November 13, 2009

Avoiding Self-Joins on Large Tables

With data warehousing solutions, I've run into the case of needing to do self-joins on large tables, a surprisingly large number of times.  A couple of examples might include something like:

  • Customer Addresses Table
    • One row in the table per address
    • Addresses are categorized by a type of home, work, or other
    • Question is to know how many of your customers work and live in different zip codes
    • If either address is not present, then include the record
The first solution that probably comes to mind is something like the following join between all of the home addresses and all of the work addresses.

  COALESCE(h.customer_id, w.customer_id) AS customer_id, AS home_zip, AS work_zip
  customer_addr h LEFT OUTER JOIN
  customer_addr w ON 
    h.customer_id = w.customer_id AND
    h.addr_type = 'home' AND
    w.addr_type = 'work'


Another, more efficient way to execute the same query in a single pass of the database.

  MIN( CASE( WHEN ADDR_TYPE_CD = 'h' THEN zip END)) AS home_zip,
  MIN( CASE( WHEN ADDR_TYPE_CD = 'w' THEN zip END)) AS work_zip
FROM cust_addr
GROUP BY customer_id;

In the new code, the query will perform a single set of operations against the table once.

#pragprowrimo word count: 18,002

Tuesday, November 10, 2009

Two Dimensional Time

One of the trickier data warehousing situations I've encountered is the need to track changes in a time line, over time.  The prototypical situation being something like contract terms, which can be renegotiated and corrected.  The contract provides a specific set of terms that are valid for particular time ranges.  The collection of those constitute a particular time line.  The desire is to track the changes to that time line, over time.  This satisfies an audit point as well provide for the analysis of "if we did a better job understanding and enforcing our contracts, what impact would that have on the bottom line?"

This concept of tracking changes to a time line, over time, introduces what you might think about as a two dimensional view of time itself.  There is what I would refer to as observed time (the movement of time itself) and perceived time (what particular points in time look like given some point in observed time).

You can imagine a picture that looks like this:

Perceived Time
   9 | C.CA..AB.
   8 | A.....AB.
   7 | A.AB.....
   6 | A.AB.....
   5 | A........
     +----------->  Observed Time

The horizontal access represents a view into time.  Movement along the vertical access represents a movement in time and potential restatement of history.

It's a fascinating way of looking at time two dimensionally.  I suppose that their could be additional dimensionality to this model as well.  In two dimensions it can describe "what I think happened."  With a third dimension, it could represent "what I think I thought happened."  And so on.

I'm sure a deeper study of philosophy and mathematics could give me a stronger academic and theoretical background for this conversation, but in the meantime, I think that modeling a two dimensional perception of time lines is sufficient for most uses:


And the round-up for today...
#pragprowrimo word count: 16,063

Monday, November 9, 2009

The PICA code

I'm a member of the IS Board of Advisors at the University of Missouri St. Louis.  One of the things that this university's IS program struggles with (as do most) is how to graduate better analysts.  I've always seen strong analytical skills as a personality trait rather than something that can be learned.  As I recall my training as a consultant, though, it occurs to me that there were several lessons about how to be a good analyst, how to ask open ended interview questions, how to perform experiments and data analysis to validate assumptions, etc.

The PICA code was a great example of poor analysis.  The project was to add a new field to the pharmacy claim table.  We went and talked with the team that was adding this field to the claim processing system to understand how it was really going to be used and what it really was.  The answer we got was "it's called the PICA code because it can have only one of 4 values... P, I, C, and A.  There was no harm done when we discovered that there were also values of N in that column, though there were lots of jokes about needing the change the column name every time a new value was discovered.

In this case, the weak analysis didn't cause much harm, but there are certainly other cases where poor root assumptions can lead to poor decision making downstream.  I hope that other universities catch up with what UMSL hopes to do in graduating students who are stronger analysts.

Word Count: 14,015

Sunday, November 8, 2009

Believing What You're Told

One of the biggest challenges I've experienced in data warehousing is that analysis has to take place from three sometimes very divergent and contradictory perspectives:
  • How does your user perceive the nature of the business?
  • How do users of the source systems perceive their own activities?
  • How do source systems record the activities of their users?
I've run into many times where there are various degrees of disagreement between those three perspectives.  It's the challenge of a good analyst to get to the bottom of these discrepancies, understand why perceptions are what they are, and be able to believe that there is complete truth in each of them... even if the source system only says one thing.

Sometimes even harder than the philosophical challenge of reconciling those divergent perspectives is the act of reconciling them into a common model of understanding in the form of a reporting system or data warehouse.  Woe to any architect or modeler who takes one of the stories as gospel and disregards the others.

#pragprowrimo word count: 12,005

Friday, November 6, 2009

Null Valued Fields in a Data Warehouse

Whether or not null-valued fields should be allowed in a relational database is a classical debate.  In data warehousing, null-valued fields should be prohibited as a rule.  Not because of the open / closed world assumption in relational controversies but because of what that implies for business applications and how that impacts the usability of a data warehouse.

As Codd states, the value of NULL by itself lacks sufficient specificity of meaning.  Does it imply that a field value is simply missing for a particular record, but might yet take on value or might have previously had value; or does it imply that the field value is not applicable for that record and never will be.  (This lack of clarity creates that open/closed debate.)  For the sake of clarity in a data warehouse, where data values are expected to provide a clear, complete, and accurate representation of business information (closed world), allowing null values create inappropriate ambiguity.

Four different types of default values should be considered in place of a null value:

  • A logically accurate value
  • Well-defined "unknown" (Codd's A-value)
  • Well-defined "not applicable" (Codd's I-value)
  • Source specific default

The biggest challenges in applying this rule will be to understand the underlying business operations and make a determination of what will be the most appropriate and most intuitive values for users to interact with.  Be very careful to consider what impact those choices will have on how users will access information in the data warehouse.  If it makes simple questions more complciated to answer, then it is probably a poor choice.

Word Count: 9,401

Thursday, November 5, 2009

Macro Data Modeling

One of the challenges with data warehousing is building the right data model.  There are trade-offs between normalization and denormalization, usability of dimensional models versus transactional 3NF models, data marts, and complications of slowly changing dimensions and auditability.

One technique that I think helps build data warehouse models that are intuitive, easy to populate and tune, and also extensible as new information requirements emerge is something I refer to as "macro data modeling."

Like the contrast between macroeconomics and microeconomics, macro data modeling examines an entity and its relationships from the perspective of the whole enterprise rather from the perspective of an individual business process.  Think of it as modeling something from a customer's perspective rather than the perspective of someone in the finance department.  Or modeling something from an enterprise perspective rather than a departmentmal perspective.

One example of this thought process might be the tracking of packages by the USPS.  On the one hand, the system that is used for scanning packages at each handoff point in the shipping chain might store SCAN transactions.  From a micro-perspective, this makes sense because the SCAN is what is actually being done in each location.  The macro-perspective of this same operation is a model that represents PACKAGE_DELIVERY.  The PACKAGE_DELIVERY record has the current status and location of a given package, key measure about how long it has spent in various statuses, key dates and times, etc; and also has a series of history records that imply every one of the SCANs that happened along the way.  That macro data model will be easier to use for most analysis purposes partly because it is focussed around a concept that will exist forever (a PACKAGE_DELIVERY is still a valid concept even after delivery is complete) rather a purely transient transaction (SCAN) that only exists at a point in time.

Another way to think about that particular example is through the metaphor of a state machine.  More on that in my #pragprowrimo (NaNoWriMo) endeavour.

Word Count: 7704

Wednesday, November 4, 2009

Data Model Extensibility

One of the major benefits that creating abstract tiers to a data model provides is that ability to then change the physical representation of the data independently from down-stream applications that are consuming that same data and upstream applications that are delivering the data.

Rules of the road:
  • Users never access base tables directly.  They always go through synonyms or views even if those views are simply "select *" views.
  • ETL always uses staging tables that are source-specific.  Use default column valeues.
  • Normalize base models, but not too excess.  Normalize them to a sound degree that aligns somewhere between the shource system and the downstream data.

Word Count :5,201

Tuesday, November 3, 2009

Layers of Data Warehouse Data Architecture

A very effective way to develop the data architecture for a data warehouse is to think about the situation from four different angles:

Data Storage - This layer is the actual physical data model for base data warehouse tables.  The purpose of this model is to provide a clear and concise representation of the entities, attributes, and relationships present in the data warehouse.

Data Presentation - This layer of the data architecture is accessed by users or user tools that provide data access.  It may be as simple as a series of synonyms or "select *" views, but it provides a strict layer of abstraction between data presentation and physical data representation.  This abstraction layer, decoupling the presentation of data from the underlying storage of data, allows for changes to made independently on either side of that boundary.

Data Staging - Like the data presentation layer, a data staging layer creates an abstraction boundary between the processes that deliver data to the data warehouse and the way in which data is represented internally within the data storage layer.  This separation allows for new and expanded data sources to feed information to the data warehouse with necessarily impacting existing data sources on the same schedule.  It helps avoid the "if I add that new source, then I'll have to rewrite another 20 ETL jobs at the same time" situation.

Data Marts - The final layer is the business-problem-centric data mart.  Arguably, data marts can be considered solutions that are independent of a more comprehensive data warehouse architecture, but I've included it here for the sake of contrast.  In this model, a data mart is a grouping of entities and business rules that represent the information necessary to address a particular and well defined business problem.  In many situations, a data mart will be a subset of data in a data warehouse that is filtered and interpreted through specific business rules.  In some situations, data marts can be represented entirely through a logical abstraction layer (rather than through additional physical tables and the duplication of data).  Data marts created without significantly duplicating underlying information are referred to as virtual data marts.

Word Count: 4882

Monday, November 2, 2009

Context Diagrams

Admittedly, I'm a whiteboard and diagramming addict.  I'm moving between office buildings after hour tomorrow.  For the move, I've marked 5 individual whiteboards to be delivered to my office.  While I love whiteboards and the ability to capture content using the 3 MP digital camera in my phone, I've been working for the past year to train myself to diagram electronically in real-time during meetings; to use MS Visio as a live design tool during meetings.  (It has worked effectively a few times.)

In the early phases of analysis and solution design, two of the most valuable pieces of documentation are a current state system context diagram and a future state system context diagram.  Some of the critical decision points in documenting systems analysis results and design plans include:
  • level of detail
  • accuracy assurance
  • shapes, colors, and sizes of boxes
  • labels
  • relationships
  • how many to times to ask "and then what?"
  • keep the diagram simple
As you begin to get into the weeds of detailed analysis, there can be a tendency to increase the level of detail in one small area of an existing diagram.  That's why it can be very valuable to layout a diagramming and documentation plan ahead of time that includes the clearly stated goal and objectives for each piece of documentation.

Diagramming is easy to do.  Documenting business systems context diagrams is hard to get right.

Word count: 3,290

Sunday, November 1, 2009

And Then What?

The 5-Whys principle made famous by Sakichi Toyoda and its application at Toyota has a very handy parallel in data warehousing analysis: "and then what?"  There hasn't been any formal statistical or mathematical analysis to determine how many times "and then what" typically needs to be asked before truly understanding a complete business process, but 5 is probably a good starting point here, too.  One of the key metaphors in data warehousing analysis is the Information Supply Chain.  In order to get a complete, end-to-end view of that supply chain, you have to keep asking the knowledge workers in that supply chain "and then what?"

Word Count: 777