Year-to-Date Query Improvement Through Transformation Tables
Posted by David Aldridge on 2006-08-15
A quick comparison here between two methods of deriving year-to-date metric values from a data warehouse fact table.
Suppose we have a simple fact table of the following form:
SQL> desc fct_sales Name Null? Type ----------------------------------------- -------- ---------------------------- PRODUCT_CD NOT NULL NUMBER LOCATION_CD NOT NULL NUMBER DATE_OF_SALE NOT NULL DATE SALES_AMT NOT NULL NUMBER
We have a business requirement to report sales by year-to-date. This is a resource intensive query because we have to report every day’s sales against every subsequent day in the calendar (or fiscal) year.
Given a dimension table “dim_sales_dates” of the structure :
SQL> desc dim_sales_dates Name Null? Type ----------------------------------------- -------- ---------------------------- DATE_OF_SALE NOT NULL DATE MTH_OF_SALE NOT NULL VARCHAR2(8) QTR_OF_SALE NOT NULL VARCHAR2(7) YR_OF_SALE NOT NULL VARCHAR2(4)
Method 1 relies on modifying the join between the date dimension table and the fact table to the following form:
select d.mth_of_sale,
sum(sales_amt)
from fct_sales f,
dim_sales_dates d
where f.date_of_sale between trunc(d.date_of_sale,'yyyy')
and d.date_of_sale and
d.mth_of_sale in ('2005 M05','2005 M06')
group by d.mth_of_sale
/
The potential problem with this is that the high number of joins required between the two tables and the relatively small data volume for the dimension table suggests that a hash join would be the optimum method. However as is documented in the Database Performance Tuning Guide the hash join can only be used for equijoins, for reasons that are obvious if you know how hashes work.
An alternative is to materialize the result of the BETWEEN join in a transformation table that relates DATE_OF_SALE to those dates prior to it in the year. Here is a sample structure:
SQL> desc trn_sales_dates_ytd Name Null? Type ----------------------------------------- -------- ---------------------------- DATE_IN_YTD NOT NULL DATE DATE_OF_SALE NOT NULL DATE MTH_OF_SALE NOT NULL VARCHAR2(8) QTR_OF_SALE NOT NULL VARCHAR2(7) YR_OF_SALE NOT NULL VARCHAR2(4)
The column DATE_IN_YTD contains all values of date from 1st January for the year of DATE_OF_SALE up to and including DATE_OF_SALE — the primary key here is of course (DATE_OF_SALE, DATE_IN_YTD).
We can now rewrite the YTD query as follows:
select t.mth_of_sale,
sum(sales_amt),
count(*)
from fct_sales f,
trn_sales_dates_ytd t
where f.date_of_sale = t.date_in_ytd and
t.mth_of_sale in ('2005 M05','2005 M06')
group by t.mth_of_sale
/
The hash join is now permitted and theoretically the query performance can improve.
Here is a complete test script, and here is the result of an execution on Oracle 10.2.0.2.0.
The demonstration shows the following:
- Use of a nested loop in method 1 anda hash join in method 2
- A drop in the consistent gets from 109,971 to 2,550
- An improvement in wall clock time from 18.51 seconds to 4.98 seconds.
Conclusion
Implementing transformation tables to allow the use of equijoins in data warehouse queries can provide performance benefits that make them well-worth considering.
2006-08-15 at 2:52 pm
Giving away my secrets? ;-)
The transform table gets a little more interesting if you also put in YTD (last year) and various moving totals.
Have you also tried functions to return a table of dates? Worked nicely in SQL but a bit awkward to use in BO.
2006-08-15 at 3:12 pm
Heh, I’m sure you have plenty more Pete.
“… functions to return a table of dates …” — sounds like an ideal topic for a blog entry. I look forward to it :D
2006-08-16 at 6:25 am
So, the transformation table is created and loaded once during warehouse load time. Hence, no need to include the LIO for the transformation table creation/load?
Mat.
2006-08-16 at 8:04 am
The transform table gets a little more interesting if you also put in YTD (last year) and various moving totals.
The very first report I wrote as a junior programmer was that (in Basic+/DMS500). 20+ years and several generations of technology later, I had a brief contract to make a very similar report written in a db-independent 4GL finish in a useable time. The feelings of Sisyphusian deja vu were almost overwhelming. The funny thing is, that latter customer just refuses to use DW technologies or modern SQL features (they have EE). The result was, and remains, awk, sort and SQL in korn shell scripts to load some “special tables.”
2006-08-16 at 10:06 am
Mat, yes the transformation table could either be populated in advance, or my own preference would be to maintain it by adding in new dates only as required. Either way the resources required to maintain it are insignificant in relation to the resources saved.
Joel, it’s a funny thing, when Luddites are put in charge of data warehouses. Strange things happen.
2006-08-16 at 3:01 pm
[...] David Aldridge challenged me to write about functions that return lists of dates. So here goes [...]
2006-08-17 at 12:05 pm
[...] Year-to-Date Query Improvement Through Transformation Tables [...]
2006-08-18 at 1:38 pm
Thanks, good post. Mike Kruckenberg mentions it in Log Buffer #6.
Cheers,
Dave Edwards
Log Buffer
2006-10-11 at 3:34 am
I don’t quite get it, take the example below for data that has a yearmonth grain. if i would like the yearmonth totals for 2006Q1 and 2006Q2 i would get wrong results because I would get duplicates of the yearmonth in ytd column that will be joined to the facts…
regards,
Rogier Werschkull
————
yearmonth in ytd yearmonth quarter
200601 200601 2006Q1
200601 200602 2006Q1
200602 200602 2006Q1
200601 200603 2006Q1
200602 200603 2006Q1
200603 200603 2006Q1
200601 200604 2006Q2
200602 200604 2006Q2
200603 200604 2006Q2
200604 200604 2006Q2
200601 200605 2006Q2
200602 200605 2006Q2
200603 200605 2006Q2
200604 200605 2006Q2
200605 200605 2006Q2
200601 200606 2006Q2
200602 200606 2006Q2
200603 200606 2006Q2
200604 200606 2006Q2
200605 200606 2006Q2
200606 200606 2006Q2
2006-10-11 at 8:34 am
Rogier,
The month always has to be included in the select clause, and if you have any other higher dimensional columns in the YTD table they would be purely for predicating — for example, “give me the year-to-month totals for all months in quarter 2006 Q3″.
If you wanted to apply the same method to year-to-quarter queries you’d use a different table ….
2006Q1 2006Q1
2006Q2 2006Q1
2006Q2 2006Q2
2006Q3 2006Q1
2006Q3 2006Q2
2006Q3 2006Q3
2006Q4 2006Q1
2006Q4 2006Q2
2006Q4 2006Q3
2006Q4 2006Q4
2006-10-11 at 9:31 am
Thank you David, it is clear to me now!
2007-03-14 at 2:29 am
it is clear to me too!!!
thanks
2007-06-14 at 12:19 pm
[...] series problems A while back David Aldridge and I both blogged about the joys of time series analysis on sparse data; that is, the sort of data [...]
2007-07-17 at 1:57 pm
I can’t get to the complete test script, and the result of an execution on Oracle 10.2.0.2.0. Please correct link. thanks.
2008-03-17 at 9:25 am
[...] Aldridge challenged me to write about functions that return lists of dates. So here [...]
2008-06-03 at 1:12 am
hai all i want pervious quarter start date and end date
2008-06-10 at 9:26 am
Transformation tables?? I hope i am not misunderstanding this, but you are just refering to having smart time/date dimensions which make these kinds of conditional summing time/date wise possible
date year period month first_day_of_year first_day_of_month first_day_of_week previous_workday
20080609 2008 200806 06 20080101 20080301 20080609 20080606
20080610 2008 200806 06 20080101 20080301 20080609 20080609
20080611 2008 200806 06 20080101 20080301 20080609 20080610
…
This can be extended to all the usefull conditional summing you might need, as in for example the normally hard to solve business day summing: give me the total of the last 5 business days…
Having pre-generated time/date dimensions like this makes conditional summing so much easier.
Just don’t understand why people call them ‘transformation’ tables, that just doesn’t make sense. What is there to ‘transform’?
2008-06-10 at 10:33 am
The issue here is that although it is expedient to code a non-equijoin to implement a metric such as a year-to-date sales sum, it is generally better to create a table to implement the requirement as equi-joins because that allows the optimizer to implement the query using the extremely low cost hash join.
As for the name, I’m not particularly attached to “transformation” … a rose by any other name would smell as sweet, right? I seem to recall that the last time I used Microstrategy they refered to these types of metrics as “transcendental”.