Showing posts with label Fun With SQL. Show all posts
Showing posts with label Fun With SQL. Show all posts

Tuesday, December 15, 2009

Fun with Recursive SQL (Part 3)


This blog post, Part 1, and Part 2 are derived from an article that I wrote as an article for Teradata Magazine and for Teradata Partners conference about fun uses for recursive SQL.

Finally, to the really interesting example of how to use recursive SQL productively.  In this example, we're dealing with segments of time that are defined with a start and end time.  Imagine staggered work shifts or a work schedule that specifies when a worker is scheduled to take breaks.

Imagine a series of time segments, like those shown in the picture below.  Each segment has a start and end time as well as a priority level.




The desired output from this data is a new, single, continuous time line that denotes only the highest priority segment for any particular clock time.



Here's a simple summary of the recursive algorithm that we're going to apply to solve this challenge:
  • For each segment at a given priority level
  • Join to the first segment that appears from a higher priority level that overlaps in any way
  • Calculate the results of the overlap.  This may result in:
    • A head segment
    • A tail segment
    • Or two new segments

We'll start off with a table that has the following definition and sample data:
create table tl (
  seg_num integer not null,
  p integer not null,
  st_tm integer not null,
  end_tm integer not null
);

insert into tl values (1,1,100,400);
insert into tl values (2,1,445,600);
insert into tl values (3,3,130,200);
insert into tl values (4,3,315,415);
insert into tl values (5,2,345,500);


You can find the first layer of overlaps that will cause splicing of any given segment using a self-join that matches every time segment with any time segments that:
  • have a higher priority; and
  • either:
    • have a start time between the current segment's start/end time; or
    • have an end time between the current segment's start/end times.
You can see the three conditions in the WHERE clause below:

select a.seg_num, 
       row_number() over(order by b.st_tm) rnk,
       b.seg_num, b.p, b.st_tm, b.end_tm
from   tl a,
       tl b
where  a.p < b.p
  and  ((a.st_tm  < b.st_tm  and a.end_tm > b.st_tm ) or
        (a.st_tm  > b.st_tm  and a.st_tm  < b.end_tm))

The trick is that we need to do the same kind of operation on the output of this first statement to determine if there are other, higher priority segments that need to splice these further. So, we recursively call the same kind of SQL on the output. In the first attempt at the recursive version, we also add some CASE logic to output exactly what the newly created slices are.

with recursive tl_rec (
  seq_num, seg_num, p, st_tm, end_tm, b_seg_num
) as (
  select 1, seg_num, p, st_tm, end_tm, -1
  from   tl 
  union all
  select a.seq_num + 1, a.seg_num, a.p,
         case when a.st_tm  < b.st_tm and 
                   a.end_tm < b.end_tm then a.st_tm  
              when a.st_tm  < b.st_tm and 
                   a.end_tm > b.end_tm then a.st_tm
              when a.st_tm  > b.st_tm and 
                   a.end_tm > b.end_tm then b.end_tm end,
         case when a.st_tm  < b.st_tm and 
                   a.end_tm < b.end_tm then b.st_tm  
              when a.st_tm  < b.st_tm and 
                   a.end_tm > b.end_tm then b.st_tm 
              when a.st_tm  > b.st_tm and 
                   a.end_tm > b.end_tm then a.end_tm end,
         b.seg_num
  from   tl_rec a,
         tl b
  where  a.p < b.p
    and  a.b_seg_num <> b.seg_num
    and  ((a.st_tm  < b.st_tm  and a.end_tm > b.st_tm ) or
          (a.st_tm  > b.st_tm  and a.st_tm  < b.end_tm))
)
select * from (
  select 
    row_number() over(partition by seg_num 
                      order by seq_num desc) rnk,
    seg_num, p, st_tm, end_tm
  from tl_rec ) z
where rnk = 1
;

And you get the following output records:



But wait!  There's something missing.  There should be a left-over LOW segment there between the two HIGH's.  We lost something along the way... Hmm...  The output of the join we've defined can only be one record.  So, when a higher priority segment is contained within the start/end of a lower priority segment, we're only returning the leading piece of the LOW segment.  There's no way for that join to output two records with different logic, though.  Luckily, with recursive SQL, we can have more than one recursive statement.  So, we just need to add another UNION ALL to output the trailing LOW segment.

with recursive tl_rec (
  seq_num, seg_num, seg_split_num, p, 
  st_tm, end_tm, b_seg_num
) as (

  ... repeat from previous SQL ...
  
  union all
  select a.seq_num + 1, a.seg_num, a.seg_split_num+1, 
         a.p, b.end_tm, a.end_tm, b.seg_num
  from   tl_rec a,
         tl b
  where  a.p < b.p
    and  a.b_seg_num <> b.seg_num
    and  (a.st_tm  < b.st_tm  and a.end_tm > b.end_tm)
)
select * from
(
  select 
    row_number() over(partition by seg_num, seg_split_num 
                      order by seq_num desc) rnk,
    seg_num, seg_split_num p, st_tm, end_tm
  from tl_rec
) z
where rnk = 1
;

Phew! With that combined SQL, now we finally get the expected output!




Personally, I recommend studying through that a few times so that you've really internalized the strategy. It certainly took me a while to actually get it to work correctly. Then look around at your existing projects or challenges you've faced in the past and see where else you can apply this same kind of recursive technique.

Good luck!

Saturday, December 12, 2009

Fun with Recursive SQL (Part 2)


This blog post and Part 1 are derived from an article that I wrote as an article for Teradata Magazine about fun uses for recursive SQL.

Part 1 of this series described a way to use recursive SQL to build up a single comma-separated field from a set of rows that were all part of the same grouping. Part 2 describes how to do just the inverse: take a field that has a comma-separated list of values and produce one row for each of the items in that list. What I refer to as "string normalization."

In the code below, the seed statement is simply a select * from our source table. The recursive statement takes the field to be pivoted into multiple rows and uses the POSITION() function to repeatedly pull off just the first item in the list. The recursion continues until our POSITION() calculation in the "dpos" field tells us the field is blank.

The context for this example is a set of survey responses in a table that contains a RESPONSE_KEY (just some primary key) and RESPONSE_TXT (a comma-separated list of responses). The SQL below parses that RESPONSE_TXT field into a single elements and outputs one row per element along with the associated RESPONSE_KEY.

WITH RECURSIVE parse_list
       (response_key, delim_pos, item_num, element, remainder) AS 
(
    SELECT response_key, 0, 0, CAST('' AS VARCHAR(100)), response_txt
    FROM   response
    UNION ALL
    SELECT response_key,
           CASE WHEN POSITION(',' IN remainder) > 0
             THEN POSITION(',' IN remainder)
             ELSE CHARACTER_LENGTH(remainder) END dpos,
           item_num + 1,
           TRIM(BOTH ‘,’ FROM SUBSTR(remainder, 0, dpos+1)),
           TRIM(SUBSTR(remainder, dpos+1))
    FROM   parse_list
    WHERE  dpos > 0 
) 
SELECT response_key, element 
FROM   parse_list p
WHERE  item_num > 0
ORDER BY response_key, item_num;

If the input RESPONSE table had the following data:
RESPONSE_KEY   RESPONSE_TXT
643            eggs, milk, toast
872            cereal, juice, bagel, cream cheese

Then the output would look like this:
RESPONSE_KEY  ELEMENT
643           eggs
643           milk
643           toast
872           cereal
872           juice
872           bagel
872           cream cheese

If you're new to recursive SQL, take the time to read, try, and experiment with these first two uses. Part 3 of this series will describe a way to consolidate multiple overlapping time segments, using some specified order or precidence, to create a single consolidated time line. That sounds confusing and potentially worthless until you see the real world example.

Friday, December 11, 2009

Fun with Recursive SQL (Part 1)


This blog post is derived from an article that I wrote as an article for Teradata Magazine about fun uses for recursive SQL.

Show of hands: how many of you have used recursive SQL in the past six months? For something useful?  Using ANSI syntax rather than Oracle's CONNECT BY?  If you said "yes" to all three, then congratulations!  You've won a prize: 15 minutes of your day back.  You can jump down to the bottom of this post and just review my fun example.  This first part is just an introduction to recursive SQL.

Introduction to Recursive SQL
First let's just review the definition for "recursion."  Handy-dandy dictionary here.... recourse... rectify.... recursion:
See: recursion
Not very helpful.  How about this instead.  A recursion is a method of solving problems in which the algorithm or function applies itself in the execution of the solution.  For instance, if I were to define a recursive algorithm for summarizing a list of numbers, I could say something like "the sum of a list is determined by adding the first number in the list to the sum of all the remaining numbers in the list; if the list is empty, then the sum is zero."  You notice how the definition of "sum of a list" includes the use of the "sum" operation, as in "sum of all the remaining numbers."  You'll also note that there's an explicit terminal condition to define what happens when you have nothing in the list.

So, recursive SQL works the same way.  A recursive SQL statement is one in which the determination of the result set requires the execution of SQL on the result set itself.  In recursive SQL, there has to be a seed statement, as well as the recursive call, and a termination condition.

WITH RECURSIVE [temp table] [column list] AS
(
  [seed statement]
  UNION ALL
  [recursive statement]
)

A Simple Example
For the sake of comfort, a simple example might involve an employee hierarchy table where each row had the employee ID and associated manager ID.  A standard query could retrieve a list of direct reports for a given employee ID:  SELECT * FROM employee WHERE mgr_id = 123.  A recursive query, however, could return the list of all employees anywhere under the hierarchy below that same manager.  An example that might be:
WITH RECURSIVE emp_hier (emp_id, mgr_id, level) AS
(
SELECT a.emp_id, a.mgr_id, 0 
FROM   employee a
WHERE  a.emp_id = 123
UNION ALL
SELECT b.emp_id, b.mgr_id, c.level+1 
FROM   employee b,
       emp_hier c
WHERE  b.mgr_id = c.emp_id
)
SELECT e.emp_title, e.emp_id, e.mgr_id, h.level
FROM   employee e,
       emp_hier h
WHERE  e.emp_id = h.emp_id
  AND  e.emp_id <> 123;

In this query, the is a simple select that pulls the manager record for which we want all of the descendants.  The selects all records from the employee table where the employee is managed by someone already in the emp_hier temporary table – hence the join on employee.mgr_id and emp_hier.emp_id.

On to some more exciting examples...


Example 1: String Concatenation
Ever notice that there are no string aggregation functions in SQL?  Something like a SUM() for character fields that would take two parameters: the field or expression from each row and a delimiter.  Usually, programmers end up having to write code outside of the database to do that kind of string manipulation. Recursive SQL can achieve that kind of character aggregation quite nicely if used appropriately.

Consider a list of email addresses for several members of the same family.  The table has a FAMILY_ID, PERSON_ID, SEQuence number, and EMAIL_ADDR.  The desired output is a single row that contains a comma-separated list of all the email addresses for the entire family (for a mailing list perhaps).

WITH RECURSIVE email_list (coverage_id, emails, seq) AS
(
  SELECT m.family_id, m.email_addr, m.seq
  FROM   members m
  WHERE  m.seq = 1
  UNION ALL
  SELECT e.emails ||’, ’|| m.email_addr, m.seq
  FROM   email_list e, members m
  WHERE  m.seq = e.seq + 1
    AND  m.family_id = e.family_id
)
SELECT r.family_id, r.emails
FROM   email_list r
WHERE  r.seq = (SELECT MAX(h.seq) FROM email_list h 
                WHERE h.family_id = r.family_id 
                GROUP BY h.family_id)
;
Let's walk through the SQL piece by piece.  First, the seed statement starts the result set off by picking just the first person from each family (that is, the people with a SEQ=1).  That sequence number is admittedly a bit artifical and contrived to make the recursive SQL easier to write.  If you needed to, you could generate the SEQ on the fly using a RANK() of people within each family and then leverage the RANK() as a sequence number.  Hopefully the example isn't lost because of that bit of confusion.

The recursive portion of the SQL simply takes what was seeded in through the first query (i.e. people with SEQ=1) and joins that to all the records in the EMAIL_ADDR table with a SEQ = SEQ + 1 (from result set).  The EMAILS output field is created by concatenating the email address from the seed to record with the next highest sequence number.

Then that result set is taken and the same thing is done.  And so on.  And so forth.

The query terminates when the result set contains no records.  That is, when we exhaust all the SEQ numbers in the table.

The outer query refines the results down to exactly what we're looking for by keeping the final record and throwing away all the intermediate records that were used to built up the results.

Next Time
Look ahead for additional posts on how to use recursive SQLto parse a field with comma separated values and how to split and reconcile overlapping time segments into a single, flat time line.