Friday, November 13, 2009

Avoiding Self-Joins on Large Tables

With data warehousing solutions, I've run into the case of needing to do self-joins on large tables, a surprisingly large number of times.  A couple of examples might include something like:

  • Customer Addresses Table
    • One row in the table per address
    • Addresses are categorized by a type of home, work, or other
    • Question is to know how many of your customers work and live in different zip codes
    • If either address is not present, then include the record
The first solution that probably comes to mind is something like the following join between all of the home addresses and all of the work addresses.


SELECT 
  COALESCE(h.customer_id, w.customer_id) AS customer_id, 
  h.zip AS home_zip,
  w.zip AS work_zip
FROM
  customer_addr h LEFT OUTER JOIN
  customer_addr w ON 
    h.customer_id = w.customer_id AND
    h.addr_type = 'home' AND
    w.addr_type = 'work'

WHERE
  (h.zip <> w.zip) OR h.zip IS NULL OR w.zip IS NULL.

Another, more efficient way to execute the same query in a single pass of the database.


SELECT
  customer_id,
  MIN( CASE( WHEN ADDR_TYPE_CD = 'h' THEN zip END)) AS home_zip,
  MIN( CASE( WHEN ADDR_TYPE_CD = 'w' THEN zip END)) AS work_zip
FROM cust_addr
GROUP BY customer_id;


In the new code, the query will perform a single set of operations against the table once.



#pragprowrimo word count: 18,002

No comments:

Post a Comment