Sharpening Stones: August 2010

Tuesday, August 31, 2010

Spare Some Change?

There's a classic joke about the difference between the IT person and the developer. Here's my version:

The boss comes into the IT guy's office and says "Hey, Joe, we've got a new line of business starting up and we really need to be able to do this new thing, X."

Joe simply says, "Sorry, Boss, that can't be done."

So, the boss goes down the hall to the development manager and says "Hey, Nancy, we've got a new line of business starting up and we really need to be able to do this new thing, X."

Nancy says, "Absolutely, Boss. We can do anything. It'll take us six months to plan, two years to develop, and another several months of user acceptance testing."

The boss goes back to his office and begins writing his resignation.

The best quote I've ever heard about change isn't the old adage that "the only thing constant is change." I'm a futurist at heart, so there's no deep insight for me in the constancy of change. The best quote I've heard about change is that "change is great because every time something changes it means you're one step closer to getting it right." That's a bit of paraphrasing and it assumes that we're primarily concerned with "good" change, of course. The point stands: if you believe that change is a good thing, then it's natural to embrace it rather than fight it.

Classic waterfall methodologies focus on defining specifications ahead of implementation so that the risk of change can be avoided. That level of contractual thinking drives unnecessary conflict, especially in business intelligence projects. One of the things we've learned through experience is that we can't know what we don't know. Naturally, the needs being addressed by a business intelligence solution will change over time as more insight is delivered to decision makers. If we already knew what the outcome was, then there wouldn't be a need for the project. Business intelligence is about discovery.

One of the core principles from the Agile Manifesto is "customer collaboration over contract negotiation" and "responding to change."

I've worked with a number of teams that grow increasingly frustrated over changes that end users make in business logic. ETL developers sometimes get so fed up with change and being forced to rewrite code over and over that they feel they should stop writing any code until "the users decide what they want!" Those developers haven't recognized that every time they change their code, they're enabling the users to understand what it is that they want. They're facilitators, not just implementers.

So, agile BI is at least as natural a fit as agile application development. Probably even more so. For BI developers to be agile, though, they have to embrace change. They have to facilitate rather than resist change.

Friday, August 27, 2010

Your Job? My Job? Our Job!

I've been trying to figure out agile data warehousing for several years now. I'm a computer scientist by training and a programmer by hobby, so I've always kept my eye on trends in traditional software development. What I tell myself, professionally, is that it helps me have alternative perspectives on BI solutions. (It's really just 'cause I like programming, even if I what I hack together isn't typically that elegant.)

Several years ago, I was introduced to one of the founders of the St. Louis XP (Extreme Programming) group, Brian Button, and decided to sit down and have lunch with him. I explained what kind of work we do to build data warehouses, and he listened very politely. At the time, he was thinking mostly about test driven development and pair programming. One of the things he asked me was "can you get your data modeler, ETL developer, and report developer all in a room together working on the same components all at once?" It occurred to me, then, that separation of development responsibilities might be a serious impediment to agile BI development.

As a former consultant, I've personally done a little bit of everything. I can data model well enough to build something workable; I've spent a lot of time writing ETL both by hand and with GUI tools; I'm a raw SQL hacker for sure; and I can even create a reasonable report or two that some VP wouldn't just use as a coaster. How often have I ever asked my staff to do that breadth of work, though? In larger organizations, I usually see and hear about separation of team: data modelers, ETL developers, reporting people. They're separate teams. That's always been under the guise of "developing technical expertise" within the team and driving consistency across projects. (Important goals for sure.)

However, when I look at successful agile software teams that I know about, that same level of separation isn't typically present. A single developer might do part of the UI, some of the dependent service modules, and the persistence layer. They're focused on delivering a particular function, not of some component of the overall application, but an external function of the application. This goes back to the previous conversation about sashimi, too [1] [2].

Of course there are some developers that are better at UI and other better at ORM, just like there are some BI folks better at data modeling and other better at data presentation. To enable more agile development, though, requires developers who are more willing and able to cross over those traditional boundaries in the work that they do. One of the leaders I work with today articulates this very well when he says that we just want "developers." What this does for agile is that it minimizes context switching and the spin between different developers working on the interrelated but different pieces of the same component. If an ETL developer decides she needs five extra fields on a table because she just learned that some previous assumption about the cardinality of a relationship was flawed, should that change require the synchronous work of:

	Modeler	ETL Developer	Report Developer
1	Changes data model	wait/other work	wait/other work
2	Deploys changes to DB	wait/other work	wait/other work
hand off
3	wait/other work	Import new metadata	wait/other work
4	wait/other work	Update ETL jobs	wait/other work
5	wait/other work	Unit test ETL	wait/other work
hand off
6	wait/other work	wait/other work	Run test queries
7	wait/other work	wait/other work	Update semantic layer
8	wait/other work	wait/other work	Update reports
N	And loop through for every change

There's a lot of opportunity for optimization there if one person is working on the tasks instead of several people. For about 66% of the teams time, they're working on something other than this one objective, and there's latency in the hand off between developers. (If you can create a parallel scheduling algorithm that gets the overall workload done faster, all things equal, than "one resource dedicated to completing all the steps in implementing each particular piece of functionality" and "helping each other out when there's nothing in the queue", please let me and every computer science professor on earth know.)

I think that for some teams this will be a challenge to their skill set and require developers to grow beyond their existing comfort zone. I'll argue that they'll be the better for it! For some teams it might be more of a challenge to ego than to actual skills: "you're going to let anyone data model?!"

The answer is "yes" and "we require an appropriate level of quality from everyone." That's why agile teams are ones with pair programming, peer reviews, and an approach that not just accepts but welcomes change.

For a team that isn't agile today, these things can't come piecemeal. If you want to be agile, you have to go all-in.

Thursday, August 26, 2010

Growing a Tree versus Building a House

When we say that we want to build "good" software, we tend to use terms that come from other engineering fields: foundation, framework, scafolding, architecture. One of the things that the agile software movement has shown us is that good solutions can come from evolutionary models as well as construction models. The difference comes from the fact that code is far easy to manipulate that physical raw materials.

When building a data warehouse, we often draw traditional, stacked-tier pictures of the data architecture: data warehouse tables, semantic layer, data marts, etc. If we start our design discussions with an assumption that anything we "build on top of" has to be "solid" then we quickly drive the overall solution away from agility. "Solid" conjures an image of a concrete foundation that has to be built to withstand floods and earthquakes. If we find a crack in our foundation, it has to be patched so the things on top don't come crumbling down.

If, instead, we try to imagine a conceptual architecture that has in mind goals of adaptability to purpose (rather than firmness) and loose coupling (rather than high contact), you can begin to imagine a higher level of agility. Look at the picture from the start of this post (from webecoist). The trees are being shaped and molded into a purpose built structure. If, part-way through the growth process, the structure needed to change to be another 6 inches higher or hold a second story of some kind, the necessary changes could be interwoven into the growth already complete. If we were constructing a new art museum and decided, half way through, that we wanted a library instead, we'd have to make some major changes or compromised to account for the fact that the foundation was only designed to hold the weight of portraits, not stacks of books.

This conceptual discussion about growing something organically rather than building it from the ground up is directly related to the sashimi discussion from yesterday. A legacy build approach says data model, build ETL, build semantic layer, build reports. There aren't any opportunities in that model to create meaningful yet consumable vertical slices.

I hear some agile BI conversations only go halfway toward the mind shift that I think is necessary. These "think big, act small" solutions sound like a model where the only change is that you poor some of the concrete foundation at a time. Building a house using this semi-agile approach:

Iteration One:

Pour foundation for the kitchen only.
Build kitchen walls.
Wire up kitchen outlets.
Install kitchen plumbing.

Iteration Two:

Pour foundation for the family room only.
Build family room walls.
Realize you need to tear out a kitchen wall to open to family room.
Reroute electricity in that wall.
Rerun wiring to kitchen
Run new wiring to family room

In this approach to agile BI, you might well deliver value to customers more quickly than if you took a monolithic waterfall approach. Since you aren't requiring yourself to plan everything up front, you run a high risk of having to do rework later, though. In a physical construction mindset, rework is very expensive (rip out wall, rewire, etc).

An organic build approach says plant a seed, watch it grow. First, a sprout appears, with roots a stem and leaves. The stem gets thicker, the roots grow deeper, and more leaves sprout. Branches grow. Flowers bud and fruit appears. When requirements change some pruning and grafting is required, but you don't have to tear down the tree and plan a new one from scratch or start a new tree on the side. The tree will grow around power lines and rocks and other trees as needed.

There's the mindset. I don't think it's easy to shift from a constructionist perspective to an organic one. Success in agile BI requires this change in thinking, though. If you're still laying foundations and screwing sheetrock onto studs, your attempt at agile BI will not be optimal.

Good luck with that.

Wednesday, August 25, 2010

Sashimi (An Agile BI Lesson for Floundering Teams)

The most recent TDWI conference generated a lot of conversation around what Agile BI means and how agile principles and practices from traditional software development can and can't be applied to business intelligence projects. I wasn't able to be at the TDWI conference and attend the presentations, but there's been a lot of chatter.

I can't speak broadly from an industry perspective on agile BI, but I can speak from my own personal experiences. The organization I work for has been undergoing a move over the past year to apply an existing agile methodology used in application development to data warehouse and business intelligence solutions. It's an ongoing study that I believe has a lot of promise and many yet unknown challenges. So far, there are three parts to this unfinished Agile BI story: sashimi, develoment culture, and developer roles. Tonight's post is on sashimi.

For those of you not familiar with the use of the term sashimi in this context, the gist is that sashimi is the art of slicing up a problem space into pieces that are at the same time independently valuable as well as quickly achievable. In an app dev project, what this means is creating a so-called walking skeleton that exercises only as many pieces of the overall solution as necessary to deliver something that is actually usable by a user. For example, if I'm building an application that's going manage medical claim payments, maybe all the first slice does is retrieve one claim from the database and display it on the screen. Then as work progresses toward the first 90-day release, more and more meet is built up on top of that skeleton, refactoring various pieces of the stack as necessary along the way. Good sashimi results in ever increasing value to end users with only as little bulk on the skeleton as necessary to achieve that.

What does good sashimi for a BI project look like?

I think that it looks the same, but feels much harder to accomplish, especially when you have a enterprise scale strategy for data warehousing and business intelligence. Imagine that you need to deliver a new reporting dashboard for department managers to do predictive labor modeling. The minimal vertical slice for that solution could include:

New tables in a staging area from a new source system, with
New ETL jobs to load data into...
New tables in an enterprise data warehouse, and
New tables in a brand new data mart, and
New objects in a semantic reporting tool (e.g. universe or model), and
(Finally) your new dashboard.

That's a lot of layers to slice through.

In traditional BI projects that I've been involved in, the project plan would call for building the solution mostly in the order shown above: bring the data in, understand the data, build a data mart, wrap it with a semantic layer, and deliver the dashboard. Along the way, you'd probably have a subteam prototyping and testing out the dashboard UI and maybe someone doing some data profiling to speed data analysis along; but the back-end pieces of development, especially, are likely to happen in stacked order.

Building a walking skeleton in software requires you to be able to refactor the bones along the way. As the analogy goes, the first version of the walking skeleton might have just one leg and one toe that attaches directly to the spine and up to the head. As the product evolves, the leg bone gets refactored into femur, patella, tibia, and fibula; more toes get added for stability; and a new set of hip bones is created. All of those change to the base skeleton in order to add muscles, skin, and clothing.

As we layer things in a traditional BI project, we often try to keep a more detailed big picture in mind up front. I know the final product is going to have two legs, that bend at the knee, need to be able to support independent orbital motion, and maintain upright stability of a 200 pound body. That all leads to five toes, several leg bones, and hips from the very beginning. An agile approach would ensure that we can notice early on that the business doesn't really need a biped mammal, but a fish. That traditional approach results in a lot of wasted assumptions and potentially wasted work. The agile approach allows for the easy reuse of what can be kept from the skeleton (spine) and a refactoring of the other pieces (leg becomes fin, toe becomes tail).

That's a lot of metaphor there, all to say that one of the requirements of agile development the ability to picture work in those thin vertical slices of functionality that deliver as much value to users as possible with as little commitment under the covers as necessary. That requires both a mind set as well as an architecture that will allow developers to quickly refactor components in the stack without having to deal with exorbitant dependencies. In an enterprise BI environment where source systems are feeding many systems, data warehouses have lots of application and direct user dependencies, and semantic reporting tools are tightly coupled to database objects, this ability to refactor requires a flexible architecture with clear boundaries between components. Examples that may be useful:

Nothing but the job that loads a table should ever reference it directly. Always have a layer between physical database objects and the users or user applications, even if it's a layer of "select *" views.
Only one job (or an integrated set of jobs) should load a given table. That job should have a versioned interface so that source systems don't all have to be enhanced when the target table changes.
Each independent application should have an independent interface into the data (read: data mart, views, etc)
Refactoring involves moving logic between layers of the solution stack: promote something from a data mart down to an enterprise data warehouse when an opportunity for reuse is identified; demote something from enterprise data warehouse to data mart when it's clearly application specific. Make sure that however you build your solution, you can move things between layers easily.
Have each layer interface with only the next layer above/below it. Don't allow the design to cross over abstraction boundaries (e.g. having a report directly access staging tables instead of pulling the data into the data warehouse and on up the chain to the report).
Build as little code as necessary to get something from one abstraction layer to the next, even if that means a simple "select *" view rather than building a full ETL job with surrogate key management, SCD Type-2 logic, and data cleansing rules. But also make sure you've built an abstraction between the data warehouse and the report so that when you add all of those features to the data warehouse, you don't necessarily have to go update all of the reports that have been built.

Those are just a few thoughts on what might be one way of laying out an architecture that will allow your BI behavior to be agile.

There are probably other good architecture to support this kind of agile sashimi for BI solutions. Remember to focus on the goal of being to deliver as much value as possible to end users with as little effort as possible, in every release. That's what this agile lesson is about. You have to change how you thing to get here, though. That will be the next post.

Monday, August 23, 2010

Business Keys

I'm engaged in a project that's actively using key concepts from Dan Linstedt's Data Vault methodology. There are lots of very powerful benefits that we seem to be realizing with this methodology, but this series of blog posts won't be particularly about use of the Data Vault. Still, I felt it was appropriate to credit the Data Vault for helping provide a structure for our own discovery.

This first post is about the struggle to identify keys for business entities. We set forth some fundamental principles when we started out on our latest large scale project. First and foremost, we would "throw nothing away". What that's meant is that we want to design the foundation of our reporting database to be a reflection not just of one department's truth, but all of the truths that might exist across the enterprise.

As a result, the design of every major entity has run into the same challenge: "what is the business key for this entity?" Well, from System A, it's the abc code. From System D it's the efg identifier. But if someone put in the xyz ID that the government assigns, then you can use System D to get to this industry file that we get updated every 3 months and link that back to an MS Access database that Ms. So-and-so maintains. Ack! Clearly we can't just say an apple is an apple. And clearly there's a data governance issue at play in this scenario also.

In one case, some of these are legacy systems that simply are what they are and aren't worth investing additional time and energy into.

Our data modeling challenge is to determine what the one business key would be for all the different instances of this one logical entity. When confronted with the challenge of having no clear business key, the project wanted to "just add a source system code and be done with it." I pushed hard against this approach for a couple of weeks, insisting that the team keep going back and working harder to dig up what could be a true business key. Eventually, I realized that I was both working contrary to one of the original goals I'd set forth and becoming the primary roadblock to progress.

Interesting side note: One of the better tricks of good software design is to defer decisions to the last possible minute. If you get away without writing some piece of code, then best to put it off until you have to write it. There's obviously some nuance and art to understanding how to leverage that. The Strategy Pattern is a good example, though.

What I realized I was doing was trying to put a huge and potentially very dynamic piece of logic out in front of our need to simply capture the information that was being created by source systems. So, we instituted what felt like a completely counter-intuitive design standard: every business key would include source system code as part of the compound key; and we would defer the need to consolidate and deduplicate instances of an entity until after all the underlying data had first been captured.

Deduplicating is the immediate next process, but this allows to be sure that we've captured all the raw information from the source system first before throwing away the fact that there are different versions of the truth in different source systems.

A very powerful lesson for us that felt very counter-intuitive; that we started considering for the wrong reasons; and finally decided to follow through on for all the right reasons!