Wednesday, November 18, 2009

Analysis of Algorithms for Data Warehousing

This topic was covered briefly in a previous post, but understanding algorithms in SQL can be very powerful.  I hope to shed some additional light on this concept with this post.

Contrary to what many programmers will tell you, SQL can be a very powerful language.  Like any language, it has it's limitations and idiosyncrasies, and despite what database vendors might tell you, the database doesn't always know the best way to execute a particular query.  Database query optimizers are limited by their imperfect knowledge of what a user wants via SQL and the imperfect statistics that it has about the underlying data.

One option is to use a hinting feature that some database management systems provide.  Hinting gives the SQL developer a means to instruct the database what order to perform joins or which indexes to use or not use.  It can be powerful, but it also depends greatly on the demographics of the underlying data.  If the data changes dramatically, then the same hints may no longer provide the optimal execution plan.

Another way to improve SQL performance is to reconsider the algorithm you write as a developer.  Like any programming language, SQL typically provides you more than one way to solve any particular problem.  A developer's ability to choose the most optimal implementation depends on knowledge of the language and experience with mathematical techniques like Big-O notation.  Data warehouses can be particularly challenging environments to do this type of analysis of algorithms because the queries being posed can be very complex.  Unraveling the business logic and reimplementing it with another SQL algorithm can become very complicated.

Consider the following typical example, though.  You have a transaction table that records all of he individual states that a particular patient may go through as they stay their course in a hospital:

PATIENT_ID
VISIT_ID
EVENT_ID
EVENT_TYPE
EVENT_TS

A typical question would be "What is the duration, start, and end time for each patient's visit?"  You might choose to write this by joining the event table to itself, on the one side filtering for "admit" event types and on the other side filtering for "discharge" event types.

SELECT
  a.patient_id, a.visit_id,
  a.event_ts, d.event_ts, d.event_ts - a.event_ts
FROM
  event a,
  event b
WHERE
  a.patient_id = b.patient_id AND
  a.visit_id = b.visit_id AND
  a.event_type = 'admit' AND
  d.event_type = 'discharge'

This SQL is a straight forward algorithm that says I want to match up all the admit records with their appropriate discharge records and return the matched records.  Assuming that most of the records are either admit or discharge, the database is likely to do a full table scan and filter for the admits, and use an index lookup to retreive all the corresponding discharge rows using the patient_id / visit_id primary key.  The Big-O notation for that would be something like O( n log n ).  That's O( n ) for the table scan times O( log n ) for the index tree search to find the matching record.

A completely different SQL approach to solving the same problem might be to aggregate the admin and discharge rows together in a single pass through the table and then compute the results from that.  That algorithm might be written something like this:

SELECT
  patient_id, visit_id,
  MIN(CASE WHEN (event_type='admin') THEN event_ts END) as admit_ts,
  MIN(CASE WHEN (event_type='discharge') THEN event_ts END) as discharge_ts,
  discharge_ts - admit_ts as duration
FROM
 event
GROUP BY
  patient_id, visit_id

This query will scan through the table once and build the patient_id/visit_id groups in stream along the way.  Computing these results is an O( n ) operation -- significantly more efficient than the common O( n log n ) algorithm.

While SQL is not a particular robust or flexible language, there are still many inherent opportunities to write either very efficient or very inefficient code.

Some powerful tools of the trade to learn in SQL optimization include:
  • Single-pass alternatives (as in the example)
  • Recursive SQL (e.g. with recursive or connect by)
  • Analytical functions and options (e.g. aggregates with their own partition by / order by)
  • Multi-grouping extensions (e.g. cube and rollup)
  • How these are affected by different database technologies (MPP, columnar, associative)

#pragprowrimo word count: 30,137

No comments:

Post a Comment