Year-to-Date Query Improvement Through Transformation Tables

A quick comparison here between two methods of deriving year-to-date metric values from a data warehouse fact table.

Suppose we have a simple fact table of the following form:

SQL> desc fct_sales 
 Name                                      Null?    Type 
 ----------------------------------------- -------- ---------------------------- 

PRODUCT_CD                                NOT NULL NUMBER 
 LOCATION_CD                               NOT NULL NUMBER 
 DATE_OF_SALE                              NOT NULL DATE 
 SALES_AMT                                 NOT NULL NUMBER

We have a business requirement to report sales by year-to-date. This is a resource intensive query because we have to report every day’s sales against every subsequent day in the calendar (or fiscal) year.

Given a dimension table “dim_sales_dates” of the structure :

SQL> desc dim_sales_dates 
 Name                                      Null?    Type 
 ----------------------------------------- -------- ---------------------------- 

DATE_OF_SALE                              NOT NULL DATE 
 MTH_OF_SALE                               NOT NULL VARCHAR2(8) 
 QTR_OF_SALE                               NOT NULL VARCHAR2(7) 
 YR_OF_SALE                                NOT NULL VARCHAR2(4)

Method 1 relies on modifying the join between the date dimension table and the fact table to the following form:

select   d.mth_of_sale, 
         sum(sales_amt) 
from     fct_sales f, 
         dim_sales_dates d 
where    f.date_of_sale between trunc(d.date_of_sale,'yyyy') 
                            and d.date_of_sale and 
         d.mth_of_sale in ('2005 M05','2005 M06') 
group by d.mth_of_sale 
/

The potential problem with this is that the high number of joins required between the two tables and the relatively small data volume for the dimension table suggests that a hash join would be the optimum method. However as is documented in the Database Performance Tuning Guide the hash join can only be used for equijoins, for reasons that are obvious if you know how hashes work.

An alternative is to materialize the result of the BETWEEN join in a transformation table that relates DATE_OF_SALE to those dates prior to it in the year. Here is a sample structure:

SQL> desc trn_sales_dates_ytd 
 Name                                      Null?    Type 
 ----------------------------------------- -------- ---------------------------- 

DATE_IN_YTD                               NOT NULL DATE 
 DATE_OF_SALE                              NOT NULL DATE 
 MTH_OF_SALE                               NOT NULL VARCHAR2(8) 
 QTR_OF_SALE                               NOT NULL VARCHAR2(7) 
 YR_OF_SALE                                NOT NULL VARCHAR2(4)

The column DATE_IN_YTD contains all values of date from 1st January for the year of DATE_OF_SALE up to and including DATE_OF_SALE — the primary key here is of course (DATE_OF_SALE, DATE_IN_YTD).

We can now rewrite the YTD query as follows:

select   t.mth_of_sale, 
         sum(sales_amt), 
         count(*) 
from     fct_sales           f, 
         trn_sales_dates_ytd t 
where    f.date_of_sale = t.date_in_ytd and 
         t.mth_of_sale in ('2005 M05','2005 M06') 
group by t.mth_of_sale 
/

The hash join is now permitted and theoretically the query performance can improve.

Here is a complete test script, and here is the result of an execution on Oracle 10.2.0.2.0.

The demonstration shows the following:

  1. Use of a nested loop in method 1 anda hash join in method 2
  2. A drop in the consistent gets from 109,971 to 2,550
  3. An improvement in wall clock time from 18.51 seconds to 4.98 seconds.

Conclusion

Implementing transformation tables to allow the use of equijoins in data warehouse queries can provide performance benefits that make them well-worth considering.

Advertisements

19 thoughts on “Year-to-Date Query Improvement Through Transformation Tables

  1. Giving away my secrets? ;-)
    The transform table gets a little more interesting if you also put in YTD (last year) and various moving totals.

    Have you also tried functions to return a table of dates? Worked nicely in SQL but a bit awkward to use in BO.

  2. Heh, I’m sure you have plenty more Pete.

    “… functions to return a table of dates …” — sounds like an ideal topic for a blog entry. I look forward to it :D

  3. So, the transformation table is created and loaded once during warehouse load time. Hence, no need to include the LIO for the transformation table creation/load?

    Mat.

  4. The transform table gets a little more interesting if you also put in YTD (last year) and various moving totals.

    The very first report I wrote as a junior programmer was that (in Basic+/DMS500). 20+ years and several generations of technology later, I had a brief contract to make a very similar report written in a db-independent 4GL finish in a useable time. The feelings of Sisyphusian deja vu were almost overwhelming. The funny thing is, that latter customer just refuses to use DW technologies or modern SQL features (they have EE). The result was, and remains, awk, sort and SQL in korn shell scripts to load some “special tables.”

  5. Mat, yes the transformation table could either be populated in advance, or my own preference would be to maintain it by adding in new dates only as required. Either way the resources required to maintain it are insignificant in relation to the resources saved.

    Joel, it’s a funny thing, when Luddites are put in charge of data warehouses. Strange things happen.

  6. Pingback: Pete-s random notes » Blog Archive » Year-to-date alternatives

  7. Pingback: The Oracle Sponge » Blog Archive » Whoops!

  8. I don’t quite get it, take the example below for data that has a yearmonth grain. if i would like the yearmonth totals for 2006Q1 and 2006Q2 i would get wrong results because I would get duplicates of the yearmonth in ytd column that will be joined to the facts…

    regards,
    Rogier Werschkull
    ————
    yearmonth in ytd yearmonth quarter
    200601 200601 2006Q1
    200601 200602 2006Q1
    200602 200602 2006Q1
    200601 200603 2006Q1
    200602 200603 2006Q1
    200603 200603 2006Q1
    200601 200604 2006Q2
    200602 200604 2006Q2
    200603 200604 2006Q2
    200604 200604 2006Q2
    200601 200605 2006Q2
    200602 200605 2006Q2
    200603 200605 2006Q2
    200604 200605 2006Q2
    200605 200605 2006Q2
    200601 200606 2006Q2
    200602 200606 2006Q2
    200603 200606 2006Q2
    200604 200606 2006Q2
    200605 200606 2006Q2
    200606 200606 2006Q2

  9. Rogier,

    The month always has to be included in the select clause, and if you have any other higher dimensional columns in the YTD table they would be purely for predicating — for example, “give me the year-to-month totals for all months in quarter 2006 Q3”.

    If you wanted to apply the same method to year-to-quarter queries you’d use a different table ….

    2006Q1 2006Q1
    2006Q2 2006Q1
    2006Q2 2006Q2
    2006Q3 2006Q1
    2006Q3 2006Q2
    2006Q3 2006Q3
    2006Q4 2006Q1
    2006Q4 2006Q2
    2006Q4 2006Q3
    2006Q4 2006Q4

  10. Pingback: Time series problems « Pete-s random notes

  11. I can’t get to the complete test script, and the result of an execution on Oracle 10.2.0.2.0. Please correct link. thanks.

  12. Pingback: Rittman Mead Consulting » Blog Archive » Year-to-date alternatives

  13. Transformation tables?? I hope i am not misunderstanding this, but you are just refering to having smart time/date dimensions which make these kinds of conditional summing time/date wise possible

    date year period month first_day_of_year first_day_of_month first_day_of_week previous_workday
    20080609 2008 200806 06 20080101 20080301 20080609 20080606
    20080610 2008 200806 06 20080101 20080301 20080609 20080609
    20080611 2008 200806 06 20080101 20080301 20080609 20080610

    This can be extended to all the usefull conditional summing you might need, as in for example the normally hard to solve business day summing: give me the total of the last 5 business days…

    Having pre-generated time/date dimensions like this makes conditional summing so much easier.

    Just don’t understand why people call them ‘transformation’ tables, that just doesn’t make sense. What is there to ‘transform’?

  14. The issue here is that although it is expedient to code a non-equijoin to implement a metric such as a year-to-date sales sum, it is generally better to create a table to implement the requirement as equi-joins because that allows the optimizer to implement the query using the extremely low cost hash join.

    As for the name, I’m not particularly attached to “transformation” … a rose by any other name would smell as sweet, right? I seem to recall that the last time I used Microstrategy they refered to these types of metrics as “transcendental”.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s