Showing posts with label PragProWriMo. Show all posts
Showing posts with label PragProWriMo. Show all posts

Tuesday, November 2, 2010

Information Portfolio Components

My model for the Information Portfolio includes four components:
  • People
  • Applications
  • Processes
  • Data



People represent the individuals or teams that use data to execute business processes.

Applications are either traditional end-user applications or integration solutions that move data through a process, between processes, or to/from people.

Processes are the business activities that leverage data to fulfill the operational objectives of the business.

Data is the cornerstone of the Information Portfolio.  It is the stuff that moves through a process, between people and applications, to act as fuel in the execution of business services.

The Information Portfolio is a knowledge base or collection of metadata that links together these four concepts in meaningful ways that transform bare data, through context and meaning, into information that can be used in the delivery of business services. [Wisdom Hierarchy]

More on each of these components and how they relate to each other in the Information Portfolio as the month continues...

    Thursday, May 27, 2010

    Reading List

    Last week, I got hear John Ladley present on Master Data Management.  I'm excited to get his new book, Enterprise Information Management, which publishes tomorrow!  I followed up with an email conversation with him and have suddenly increased my reading list by about 10 new books he suggested I pick up and read -- and I thought my shelf was pretty well stocked!

    On Tuesday, I walked into my office to find a present for me on my desk.  It was Business Intelligence, a newly published book by local UMSL professor Rajiv Sabherwal.  I know Rajiv through my membership on the UMSL IS Board of Advisors, and it turns out that his neighbor works with me, too.

    Speaking of books, I carry around a printed out draft of my PragProWriMo book from last November.  It's a good reminder about the importance of communication.  I originally had the goal of revising a couple of chapters and submitting them in January, but I've adjusted that to be revise during PragProWriMo'11 and submit after that.

    Friday, January 15, 2010

    The Power of Slow

    I spent part of my work day today doing mid-year reviews with my staff.  For us, it's merely a mile marker in the year, not any large event.  I meet with each member of my staff on a weekly basis formally as well as in ad hoc drive-bys throughout the week.  It works well to maintain that consistent and continuous communication.


    During one of these 1-on-1 meetings today, one of my architects was voicing some frustration about the attitude a few project leads were taking with him.  They were pushing the team to just hurry up build something.  Whatever they could do quickly that worked.  He was saying that he had a hard time describing to these project leads how a touch more patience and even a different attitude about the definition of progress would help them see the risk in their approach.

    I mentioned to him that I've been slowly reading Carl Honoree's The Power of Slow since last summer.  Only in that moment did I realize there was a way of expressing "slow" to addictively "fast" people.  Slow isn't about taking more time to get things done.  Slow is about the internal pace at which things are done.  From a software architecture perspective, slow doesn't mean spending months in design before anything is built; it means taking the time as your write each class and method to think through how best to write that particular class or method.  It isn't about wastefully over-engineering the solution to be something that will "be scalable into the future for unknown other uses" while the immediate problem is left unsolved.  It's about being methodical and intention in each action we take.  That creates a slower appearance, because the process becomes more efficient and there's less waste moving up and down the rate of progress curve.




    The "Slow" line represents a steady and consistent stream of work.
    You can think of the dips in the "Fast" line as any number of things:
    • Hurry up, then wait
    • Work fast, then go back to fix your mistakes
    To reach the 100% mark at the end of the chart, both approaches may get there in the same amount of time, but the "Fast" approach consumed a lot more effort.  Effort being the length of the line itself.


    One of the fears that a "fast" people have, especially "fast" managers, is that "slow" implies developing ultra-sophisticated solutions, most of which will never actually be used; over-engineering; or architecture-creep.  For some "slow" people it probably does, but the right kind of "slow" is merely a thoughtful, methodical implementation that gives everyone enough time to make sure they're doing the next step in the project correctly.  I may feel slow, but that's only because it's efficient.


    Ironically, we see this in the world around us all the time.
    • When starting your car from a stopped position on slick slow or ice, you have to do so slowly or the wheels will merely spin.  In both cases, you'll move forward some, but when accelerating too quickly, you'll burn a lot more fuel and tear on your tires.
    • In cartoons, we see the antagonist rushing around in a hurry from place to place, rushing; while the protagonist slowly and thoughtfully moves through the chaos and reaches the goal first.
    • In action/comedy movies, we see a brash martial artist execute a flurry of activity in a dramatic attack, only to be stopped by a few small simple motions of the hero.
    So many people incorrectly think that looking productive is the same as being productive.

    In my own writing, my wife points out to me that there is power in brevity.  Sentences should say all that they need to say and no more.  I tend to add too many unnecessary adjectives, or flourishes, or sentence complexities.  More words often doesn't imply more meaning.

    Friday, December 4, 2009

    PragProWriMo Reflection


    My Gratitude:
    I want to thank @praglife and The Pragmatic Bookshelf for the spur to participate in NaNoWriMo this past month.  I've always wanted to write a technical book and now I can certainly say that I have, whether I ever publish it or not.  So, "thank you."

    The bigger thanks goes to my wife, a 4-year veteran NaNoWriMo writer.  It was fun to write together this year rather than feel like I was merely in a support role while she wrote a novel over the course of a month.  We really enjoyed writing together, and she provide me with great encouragement.  "Thank you!"

    My Experience:
    I'm not sure what goals other PragProWriMo writers set for themselves this past month, but I (perhaps ambitiously) followed the NaNoWriMo goal of 50,000 words.  In our house, the goal was 2,000 words per day with one day off per week.  That still keeps you ahead of the standard 1,667 word per day pace, even with the day off.  Our house is also occupied by two wonderful, but often time consuming toddlers.  So, writing took place mostly after 9:00 PM or maybe a quick couple hundred words during the day on the weekend.  2,000 words per night, for me, was about two hours per night.  Depending on your own sleep habits, experience with small children that get up early, and penchant for sympathy, respond to that description of the time commitment as you see fit.

    Beginning:
    I did start my endeavor with some topics that I wanted to cover within the scope of The Practical Data Warehouse.  I had 47 ideas that needed to be covered throughout the book, and felt like that was probably a dozen or so short, but I could pick those up along the way.  So, I started writing at midnight on November 1st.

    Despite the fact that I tend to start most of what I write with the introduction and executive summary sections, to lay the ground work in my head for what the rest of the paper should convey, PragProWriMo started right in with the first chapter of true techncial content.  Things moved fairly quickly at first.  Most of what I wanted to write was already in my head, just waiting to transfer via keyboard.

    Organizing:
    As I started writing, various non-sequitor ideas would float through my head, too.  All of those went into a backlog list or as a note in some other skeleton section somewhere else in the book.  I wrote just enough in those notes to make sure I could remember what the topic really was supposed to be about.

    That said, I certainly didn't compose linearly.  I spend the first couple of days in section 2, then jumped to section 5, back to 3, then off of section 1 for a little while.

    Motivating:
    Most of the energy to keep writing came from the experiences my wife has had writing novels for NaNoWriMo in the past.   Sometimes, other tools were necessary, too: chips and very hot salsa, candy, Starbucks.  Things like that.

    The internal motivation, though, came from the fact that I have accumulated a lot of thoughts and ideas about data warehousing, in particular, throughout my career; and I thought there was something more to be said beyond the books and the blog posts that I read.  My reading and experience tells me that many data warehouse projects struggle because people don't know what works and what doesn't.  That comes in primarily two different types: either people are technically unqualified to elegantly address the challenges of data warehouse implementation or they don't understand the purpose and potential value of data warehouse solutions.  So, I'm hoping that what I have to say can fill a gap there and bring people from either of those extremes toward a better understanding of data warehousing.

    The Challenges:
    My biggest writing challenge has always been coming up with an idea that I think someone else will think is worth writing.  I've written a few magazine articles before, but only when I thought I truly had a novel idea that I wanted to share with the world.  I'd have written more, but that's a pretty high standard, I think.  So, in the course of writing the first draft of The Practial Data Warehouse, I had to fight that demon that kept telling me what I had to say wasn't worth reading.

    The other major challenge was exhaustion (see previous section about Toddlers that get up early).  There were several nights where I nodded off at the keyboard, and woke to find a few garbled sentences and 1,652 letter "P"s in the middle of my document.

    The Next Steps:
    My plan from here out is to rest on The Practical Data Warehouse for another couple of days, and then start revising.  One topic a night for an hour.  I think that most nights, I'll end up with a blog post that is related to the topic in the book.

    As I go through topic by topic, I expect that I'll need to do a few more experiments to create examples for some of the content ideas.  That should be fun.  As a current PHB, I don't get to write nearly as much code as I used to.

    In a month or two... off to a publisher!!

    Wednesday, December 2, 2009

    The Agility of Touch-It / Take-It

    A parable about the agility that "Touch It, Take It" adds to data warehousing; and the extra work that a misuse of "You Aren't Going to Need It" creates.

    Once, there was a great chief called Yagni.  Chief Yagni's village was very prosperous and had grown much during his rule.  Eventually, Chief Yagni decided that it was time for the village to have a new gathering space as the old one had been well out grown.  So, Chief Yagni recruited 2 strong men, Gwyn and Titi, to bring stones from the quarry to the village center so that a group of builders could stack them into the new gathering space.

    Gwyn and Titi both arrived on Monday morning to receive their directions from Chief Yagni.  Yagni reviewed the building plans and told Gwyn and Titi that he needed 20 large flat square stones for the base of the building.  Gwyn and Titi took their push carts down to the quarry, gathered rocks, and returned to the village center.  They emptied their rocks together in piles for Chief Yagni's review.  Gwyn's pile had exactly 10 flat square stones.  Titi's had 10 flat square stones and 3 smaller angled stones.

    "Why are these here?" asked Yagni.

    "I had to pick them up off of the flat stones in the quarry," replied Titi, "so I thought I would just bring them along in case there was a use for them."

    "Get rid of them!" shouted Yagni angrily.  "You've wasted time in gathering those worthless rocks I did not ask you to collect.  We need 10 tall narrow stones for the doorway.  Go back to the quarry and bring me those.  Only those, Titi!  When you see the other rocks, tell yourself that You Aren't Going to Need It."


    Gwyn smiled at Titi's scolding, feeling proud that he'd followed the chief's directions so precisely.  Titi still believed that the angled stones might eventually come in handy, though.  Gwyn and Titi began pushing their carts back to the quarry.  Gwyn's light and empty.  Titi's partly full of unwanted rocks.

    Frustrated, not wanting to push the angled rocks all the way back to the quarry, Titi dumped the extra rocks in a small pile just outside of chief Yagni's sight.

    Gwyn and Titi gathered the tall narrow stones the chief asked for.  Titi, again, had to clear angled rocks from on top of the narrow stones, and added them to his cart.  This time, Titi paused and dropped the extra angled stones on his pile just outside of Chief Yagni's sight.

    Gwyn and Titi returned to the chief with their carts full of only tall narrow stones and the chief was pleased with them both.  This continued for several more trips until the new meeting place was nearly complete.  Gwyn following directions exactly and Titi always bring back more than asked, but leaving the extra angled rocks on his pile.  By this time, Titi has accumulated a large enough pile of angled stones to fill his entire cart.

    On their last delivery to Chief Yagni, the chief looked over the plans again and stroked his chin in thought.  "Gwyn, Titi," he said.  "I need you to bring some angled stones for the roof.  Like those that you brought back on the first trip, Titi.  Go back to the quarry and bring me two carts full of those."

    Gwyn and Titi hurried back toward the quarry.  Gwyn went to the quarry and began collecting his cart full of angled rocks, but Titi had the large pile he had been accumulating during his other trips.  He stopped just outside of the chief's sight, collected all of his angled rocks into his cart, and returned to the chief well before Gwyn had even loaded half of his own rocks.

    "Titi.  How did you gather these rocks so quickly, when Gwyn hasn't returned yet?"

    Titi explained to his chief that when he had to pick up a stone anyway, he decided that if he had to Touch It, then he should Take It.

    The chief was please with Titi's foresight and promoted him to lead rock collector.

    Sunday, November 29, 2009

    Consume: The 4th "C"

    A data warehouse, as with other technology solutions, is a partnership of those who build it and those who use it.  Both have a responsibility for the success of the data warehouse.  Therefore, consume is the fourth "C" of data warehousing.

    If the sound of a falling tree occurs in the woods and no one is there, did the tree fall?

    The data warehouse is fundamentally a communication tool.  The information that it conveys through its data and the representation it provides of business activities can be incredibly valuable.  All the work that goes into collecting, correlating, and conveying information in the data warehouse is wasted if the right decision makers are not consuming the information.  Consumption is not only about the technical ability for some abstract decision maker to use the data warehouse for analysis.  Rather, it is about ensuing that each enhancement to the data warehouse helps to drive adoption and increase both strategic and operational use of the system.  As users have a positive experience that aligns to their immediate business need, they're more likely to come back to the data warehouse with their next question, rather than striking off on a journey to collect information from various systems, correlate that data together, and then convey that information to decision makers and senior leaders.

    #PragProWriMo word count: 50,058 - SUCCESS!!

    Saturday, November 28, 2009

    Convey: The 3rd "C"

    If a tree falls in the woods and no one is there, does it make a sound?

    One of the primary responsibilities of the data warehouse is to take information that would otherwise not be available and deliver it to decision makers and analysts who can do something valuable with the information and make the world a better place.  Perhaps you might think that's a bit of a stretch, or perhaps I'm being overly optimistic.  However, my position is that we are more likely to make decisions that move society forward if we have more knowledge about the context in which we're making those decisions.  So, even if we don't always use better information to make better decisions, we couldn't even have the opportunity to try to make decisions if not for tools like the data warehouse.

    Effectively conveying information is a factor of the format of the information, performance of the system, the accuracy of the content, and the ease of use with tools that will be used to access the information.  Most importantly, the data warehouse has to be able convey to users, in a natural way, how to find the information they need and what aspects of the physical implementation impact how the system can and can't be used.

    #PragProWriMo word count: 48,061

    Thursday, November 26, 2009

    Correlate: The 2nd "C"

    Collecting information from a variety of sources is powerful in and of itself, but having an integrated collection of data makes analysis across business processes, subject areas, and source systems many times more efficient.  The data warehouse must provide an access layer that allows users to easily and naturally merge together pieces of business data that are naturally related in the business model.  There are some aspects of correlation that are business driven and the data warehouse will be a victim or beneficiary of.  For instance, the consistent assignment of employee numbers across applications or an enforced uniqueness of cost centers across multiple financial systems or the corporate data governance decision that only one systems can be considered the authoritative source for any particular data element. 

    Other correlation activities are ones that can be supported directly within the data warehouse environment:
    • Code set translations ensure that a user can reference any data source using the approved corporate standard code set for some particular attribute.
    • Cross-reference translates that allow two different identifiers for the same particular entity to be combined together.
    • Transformation of one representation of a particular identifier into a matching representation somewhere else.  For instance, if one system uses a 15 digit cost center that includes accounting unit, account, and sub account, the data warehouse should take that 15 digit string, break it into components, and match those against the information from the finance system.

    Having a data warehouse to preexecute these kinds of correlation activities once and store the results for convenient use in tens of thousands of queries every month or even every day is a huge benefit to analysts through clear communication and improved performance.

    #PragProWriMo word count: 46,141

    Tuesday, November 24, 2009

    Collect: The 1st "C"

    The 4 C's of Practical Data Warehousing take a very open approach to describing what a data warehouse is responsible for doing.  Perhaps, the function that can be interpretted most broadly from a technical perspective is the first C: Collect.

    The first job of a data warehouse is to bring together, in one business tool, the information that an analyst, knowledge worker, or decision maker needs to understand business operations.  Whether that collection of information is a literal copy between data stores, a federation of databases, or a more complicated transformation of transactional data into a dimensional model is merely an implementation detail.  Some of those choices are more or less practical depending on the underlying systems and data (which we'll discuss later), but to be of any value, the data warehouse has to do th collecting of information together so that users don't have to.  One of the key benefits of data warehousing is the ability of an analyst to have a one-stop-shop for most of the information they need to do their analysis.  In analytically immature organizations, analyst will typically spend 80% of their time collecting data and putting it into some kind of local data store (which might anything from flat files to MS Access to small databases) and only 20% of their time doing analysis on that data.  One of the goals of the data warehouse is to flip that ratio so that knowledge workers are able to spend 80% of their time analyzing business operations and only 20% of their time retrieving data from as few different sources as necessary.

    When various analysts from different departments (marketing, strategic planning, sales, finance, etc) all ask the same people for the same data on a continual basis, it also prevents the application teams from having time to make improvements or plan upgrades to the applications themselves.  There are still organizations that have multiple staff members dedicated to the work of fulfilling data extract requests to support internal analytical needs.  A data warehouse that collects the information together once, into a common enterprise model of business activities, satisfies all of those individual departments with one extract from each source system and a consolidated environment in which to do their analysis.

    Saturday, November 21, 2009

    The 4 C's of a Data Warehouse

    Anyone shopping for an engagement ring is familiar with the 4 C's of diamond quality: cut, clarity, color, and carats.  While there are other things that make one diamond better or worse than another, these are the most commonly used.

    So, what are the 4 C's of data warehouse quality:
    • Collect
    • Correlate
    • Convey
    • Consume
    These are the 4 things that a data warehouse has to be most effective at to achieve success.  Over the next four posts, I'll describe what each of these represents and ways that you can measure the quality of your own data warehouse implementation against these.


    #PragProWriMo word count: 36,152

    Wednesday, November 18, 2009

    Analysis of Algorithms for Data Warehousing

    This topic was covered briefly in a previous post, but understanding algorithms in SQL can be very powerful.  I hope to shed some additional light on this concept with this post.

    Contrary to what many programmers will tell you, SQL can be a very powerful language.  Like any language, it has it's limitations and idiosyncrasies, and despite what database vendors might tell you, the database doesn't always know the best way to execute a particular query.  Database query optimizers are limited by their imperfect knowledge of what a user wants via SQL and the imperfect statistics that it has about the underlying data.

    One option is to use a hinting feature that some database management systems provide.  Hinting gives the SQL developer a means to instruct the database what order to perform joins or which indexes to use or not use.  It can be powerful, but it also depends greatly on the demographics of the underlying data.  If the data changes dramatically, then the same hints may no longer provide the optimal execution plan.

    Another way to improve SQL performance is to reconsider the algorithm you write as a developer.  Like any programming language, SQL typically provides you more than one way to solve any particular problem.  A developer's ability to choose the most optimal implementation depends on knowledge of the language and experience with mathematical techniques like Big-O notation.  Data warehouses can be particularly challenging environments to do this type of analysis of algorithms because the queries being posed can be very complex.  Unraveling the business logic and reimplementing it with another SQL algorithm can become very complicated.

    Consider the following typical example, though.  You have a transaction table that records all of he individual states that a particular patient may go through as they stay their course in a hospital:

    PATIENT_ID
    VISIT_ID
    EVENT_ID
    EVENT_TYPE
    EVENT_TS
    

    A typical question would be "What is the duration, start, and end time for each patient's visit?"  You might choose to write this by joining the event table to itself, on the one side filtering for "admit" event types and on the other side filtering for "discharge" event types.

    SELECT
      a.patient_id, a.visit_id,
      a.event_ts, d.event_ts, d.event_ts - a.event_ts
    FROM
      event a,
      event b
    WHERE
      a.patient_id = b.patient_id AND
      a.visit_id = b.visit_id AND
      a.event_type = 'admit' AND
      d.event_type = 'discharge'
    

    This SQL is a straight forward algorithm that says I want to match up all the admit records with their appropriate discharge records and return the matched records.  Assuming that most of the records are either admit or discharge, the database is likely to do a full table scan and filter for the admits, and use an index lookup to retreive all the corresponding discharge rows using the patient_id / visit_id primary key.  The Big-O notation for that would be something like O( n log n ).  That's O( n ) for the table scan times O( log n ) for the index tree search to find the matching record.

    A completely different SQL approach to solving the same problem might be to aggregate the admin and discharge rows together in a single pass through the table and then compute the results from that.  That algorithm might be written something like this:

    SELECT
      patient_id, visit_id,
      MIN(CASE WHEN (event_type='admin') THEN event_ts END) as admit_ts,
      MIN(CASE WHEN (event_type='discharge') THEN event_ts END) as discharge_ts,
      discharge_ts - admit_ts as duration
    FROM
     event
    GROUP BY
      patient_id, visit_id
    

    This query will scan through the table once and build the patient_id/visit_id groups in stream along the way.  Computing these results is an O( n ) operation -- significantly more efficient than the common O( n log n ) algorithm.

    While SQL is not a particular robust or flexible language, there are still many inherent opportunities to write either very efficient or very inefficient code.

    Some powerful tools of the trade to learn in SQL optimization include:
    • Single-pass alternatives (as in the example)
    • Recursive SQL (e.g. with recursive or connect by)
    • Analytical functions and options (e.g. aggregates with their own partition by / order by)
    • Multi-grouping extensions (e.g. cube and rollup)
    • How these are affected by different database technologies (MPP, columnar, associative)

    #pragprowrimo word count: 30,137

    Friday, November 13, 2009

    Avoiding Self-Joins on Large Tables

    With data warehousing solutions, I've run into the case of needing to do self-joins on large tables, a surprisingly large number of times.  A couple of examples might include something like:

    • Customer Addresses Table
      • One row in the table per address
      • Addresses are categorized by a type of home, work, or other
      • Question is to know how many of your customers work and live in different zip codes
      • If either address is not present, then include the record
    The first solution that probably comes to mind is something like the following join between all of the home addresses and all of the work addresses.


    SELECT 
      COALESCE(h.customer_id, w.customer_id) AS customer_id, 
      h.zip AS home_zip,
      w.zip AS work_zip
    FROM
      customer_addr h LEFT OUTER JOIN
      customer_addr w ON 
        h.customer_id = w.customer_id AND
        h.addr_type = 'home' AND
        w.addr_type = 'work'

    WHERE
      (h.zip <> w.zip) OR h.zip IS NULL OR w.zip IS NULL.

    Another, more efficient way to execute the same query in a single pass of the database.


    SELECT
      customer_id,
      MIN( CASE( WHEN ADDR_TYPE_CD = 'h' THEN zip END)) AS home_zip,
      MIN( CASE( WHEN ADDR_TYPE_CD = 'w' THEN zip END)) AS work_zip
    FROM cust_addr
    GROUP BY customer_id;


    In the new code, the query will perform a single set of operations against the table once.



    #pragprowrimo word count: 18,002

    Monday, November 9, 2009

    The PICA code

    I'm a member of the IS Board of Advisors at the University of Missouri St. Louis.  One of the things that this university's IS program struggles with (as do most) is how to graduate better analysts.  I've always seen strong analytical skills as a personality trait rather than something that can be learned.  As I recall my training as a consultant, though, it occurs to me that there were several lessons about how to be a good analyst, how to ask open ended interview questions, how to perform experiments and data analysis to validate assumptions, etc.

    The PICA code was a great example of poor analysis.  The project was to add a new field to the pharmacy claim table.  We went and talked with the team that was adding this field to the claim processing system to understand how it was really going to be used and what it really was.  The answer we got was "it's called the PICA code because it can have only one of 4 values... P, I, C, and A.  There was no harm done when we discovered that there were also values of N in that column, though there were lots of jokes about needing the change the column name every time a new value was discovered.

    In this case, the weak analysis didn't cause much harm, but there are certainly other cases where poor root assumptions can lead to poor decision making downstream.  I hope that other universities catch up with what UMSL hopes to do in graduating students who are stronger analysts.

    Word Count: 14,015

    Sunday, November 8, 2009

    Believing What You're Told

    One of the biggest challenges I've experienced in data warehousing is that analysis has to take place from three sometimes very divergent and contradictory perspectives:
    • How does your user perceive the nature of the business?
    • How do users of the source systems perceive their own activities?
    • How do source systems record the activities of their users?
    I've run into many times where there are various degrees of disagreement between those three perspectives.  It's the challenge of a good analyst to get to the bottom of these discrepancies, understand why perceptions are what they are, and be able to believe that there is complete truth in each of them... even if the source system only says one thing.

    Sometimes even harder than the philosophical challenge of reconciling those divergent perspectives is the act of reconciling them into a common model of understanding in the form of a reporting system or data warehouse.  Woe to any architect or modeler who takes one of the stories as gospel and disregards the others.

    #pragprowrimo word count: 12,005

    Friday, November 6, 2009

    Null Valued Fields in a Data Warehouse

    Whether or not null-valued fields should be allowed in a relational database is a classical debate.  In data warehousing, null-valued fields should be prohibited as a rule.  Not because of the open / closed world assumption in relational controversies but because of what that implies for business applications and how that impacts the usability of a data warehouse.

    As Codd states, the value of NULL by itself lacks sufficient specificity of meaning.  Does it imply that a field value is simply missing for a particular record, but might yet take on value or might have previously had value; or does it imply that the field value is not applicable for that record and never will be.  (This lack of clarity creates that open/closed debate.)  For the sake of clarity in a data warehouse, where data values are expected to provide a clear, complete, and accurate representation of business information (closed world), allowing null values create inappropriate ambiguity.

    Four different types of default values should be considered in place of a null value:

    • A logically accurate value
    • Well-defined "unknown" (Codd's A-value)
    • Well-defined "not applicable" (Codd's I-value)
    • Source specific default

    The biggest challenges in applying this rule will be to understand the underlying business operations and make a determination of what will be the most appropriate and most intuitive values for users to interact with.  Be very careful to consider what impact those choices will have on how users will access information in the data warehouse.  If it makes simple questions more complciated to answer, then it is probably a poor choice.

    Word Count: 9,401

    Tuesday, November 3, 2009

    Layers of Data Warehouse Data Architecture

    A very effective way to develop the data architecture for a data warehouse is to think about the situation from four different angles:

    Data Storage - This layer is the actual physical data model for base data warehouse tables.  The purpose of this model is to provide a clear and concise representation of the entities, attributes, and relationships present in the data warehouse.

    Data Presentation - This layer of the data architecture is accessed by users or user tools that provide data access.  It may be as simple as a series of synonyms or "select *" views, but it provides a strict layer of abstraction between data presentation and physical data representation.  This abstraction layer, decoupling the presentation of data from the underlying storage of data, allows for changes to made independently on either side of that boundary.

    Data Staging - Like the data presentation layer, a data staging layer creates an abstraction boundary between the processes that deliver data to the data warehouse and the way in which data is represented internally within the data storage layer.  This separation allows for new and expanded data sources to feed information to the data warehouse with necessarily impacting existing data sources on the same schedule.  It helps avoid the "if I add that new source, then I'll have to rewrite another 20 ETL jobs at the same time" situation.

    Data Marts - The final layer is the business-problem-centric data mart.  Arguably, data marts can be considered solutions that are independent of a more comprehensive data warehouse architecture, but I've included it here for the sake of contrast.  In this model, a data mart is a grouping of entities and business rules that represent the information necessary to address a particular and well defined business problem.  In many situations, a data mart will be a subset of data in a data warehouse that is filtered and interpreted through specific business rules.  In some situations, data marts can be represented entirely through a logical abstraction layer (rather than through additional physical tables and the duplication of data).  Data marts created without significantly duplicating underlying information are referred to as virtual data marts.

    Word Count: 4882

    Monday, November 2, 2009

    Context Diagrams

    Admittedly, I'm a whiteboard and diagramming addict.  I'm moving between office buildings after hour tomorrow.  For the move, I've marked 5 individual whiteboards to be delivered to my office.  While I love whiteboards and the ability to capture content using the 3 MP digital camera in my phone, I've been working for the past year to train myself to diagram electronically in real-time during meetings; to use MS Visio as a live design tool during meetings.  (It has worked effectively a few times.)

    In the early phases of analysis and solution design, two of the most valuable pieces of documentation are a current state system context diagram and a future state system context diagram.  Some of the critical decision points in documenting systems analysis results and design plans include:
    • level of detail
    • accuracy assurance
    • shapes, colors, and sizes of boxes
    • labels
    • relationships
    • how many to times to ask "and then what?"
    • keep the diagram simple
    As you begin to get into the weeds of detailed analysis, there can be a tendency to increase the level of detail in one small area of an existing diagram.  That's why it can be very valuable to layout a diagramming and documentation plan ahead of time that includes the clearly stated goal and objectives for each piece of documentation.

    Diagramming is easy to do.  Documenting business systems context diagrams is hard to get right.

    Word count: 3,290

    Sunday, November 1, 2009

    And Then What?

    The 5-Whys principle made famous by Sakichi Toyoda and its application at Toyota has a very handy parallel in data warehousing analysis: "and then what?"  There hasn't been any formal statistical or mathematical analysis to determine how many times "and then what" typically needs to be asked before truly understanding a complete business process, but 5 is probably a good starting point here, too.  One of the key metaphors in data warehousing analysis is the Information Supply Chain.  In order to get a complete, end-to-end view of that supply chain, you have to keep asking the knowledge workers in that supply chain "and then what?"

    Word Count: 777

    Saturday, October 31, 2009

    T-Minus 27 Hours (to PragProWriMo)

    NaNoWriMo starts tomorrow and the folks from The Pragmatic Bookshelf are part of my inspiration to participate this year.  Over the course of the next 30 days, my wife and I will both be spending late nights up writing together.  She'll be exercising her passion and creativity through a new novel.  I'll be evangelizing my passion for the inherent value of business information and sharing some powerfully practical techniques for building data warehousing solutions that draw on and emphasize business insight -- emphasis on the word "practical."

    As I sprint through 2,000 words a day, I'll also summarize those topics and the experience here on Sharpening Stones.

    If you're writing along with PragProWriMo or NaNoWriMo, visit my user page and be my writing buddy!