Higher Dimensional Values In Fact Tables

update: 24-Jul-2005

Jonathan Lewis writes to educate me on function-based indexes and statistics thereon … it is indeed possible to gather histogram data on them through the mechanism of the DBMS_Stats method_opt => 'for all hidden columns' — who'd a thunk it? It's documented here. I hope it doesn't sound like sour grapes if I wish that this were also documented here.

In this article I dismissed the use of function-based indexes (for the purposes outlined) based on the inability of Oracle to generate histograms on them — whether this new-to-me information changes that judgement will take some further thought and analysis. I may move it to another blog entry once I've reassessed the issue, as an analysis of the costs and benefits of function-based indexes vs. "materializing" the function result in the table rather appeals to me.

Thanks again, Jonathan.

introduction

In getting performance out of an Oracle database, and in data warehousing (DWh) in particular, we live or die by making effective use of the cost-based optimizer (CBO). As I wrote before, "The CBO is based on statistics — statistics about tables, statistics about indexes, statistics about the distribution of values within a column". So we owe it to ourselves to go the extra mile in our search for statistical accuracy, and sometimes we might have to bend a few rules to do so.

This thought was prompted by the rebuilding of a major fact table, primarily in order to change its partitioning scheme, but the opportunity for some performance tweaking is not to be sniffed at. Accordingly, I have a plan … a cunning plan if I may say so. A plan that may make database purists squirt coffee from their noses, and I consider that to be fair warning so let's press on.

the scenario

The fact table in question has a number of dimensional values for which the parent values are deterministically based on the lowest value itself. For example, DATE_OF_DAY has parent values of DATE_OF_MONTH, FISCAL_MONTH_CD, FISCAL_QUARTER_CD, FISCAL_YEAR_CD. Given a particular value of DATE_OF_DAY the calculation of the other four values is a simple matter, for example:

FISCAL_MONTH_CD =
To_Char(Add_Months(DATE_OF_DAY, 3), 'YYYY" M"MM')

So for 04-Dec-2004 the fiscal month is "2005 M03"

(Did you know that you can insert arbitrary strings of characters in a date-to-char conversion using double-quotes? A lot of people are surprised by that. Surprisingly.)

These parent values are stored in a dimension table, possibly snowflaked where required by the presence of summary tables at the DATE_OF_MONTH level, for example.

Now here's the problem that I'm trying to solve … when a report query is submitted to the database and includes a predicate such as FISCAL_MONTH_CD = '2005 M04', and the query is directed to the fact table at the DATE_OF_DAY level, Oracle can only take the most generic estimation of the cardinality of the result set — histograms on FISCAL_MONTH_CD in the dimension table and on DATE_OF_DAY in the fact table are not effective in this situation.

a demonstration script

set feedback off heading off

drop table dave_fct;

create table dave_fct
pctfree 0 nologging
as
select * from dba_objects;

begin
dbms_stats.gather_table_stats
(
OWNNAME => user,
TABNAME => 'DAVE_FCT',
METHOD_OPT => 'FOR COLUMNS OWNER SIZE 254'
);
end;
/

drop table dave_dim
/
create table dave_dim — lovely name!
(owner primary key,
owner_2char not null)
as
select distinct
owner,
substr(owner,1,2)
from dave_fct
/

begin
dbms_stats.gather_table_stats
(
OWNNAME => user,
TABNAME => 'DAVE_DIM',
METHOD_OPT => 'FOR COLUMNS OWNER SIZE 254,
OWNER_2CHAR SIZE 254'
);
end;
/

set autotrace on

— QUERY 1
select count(*)
from dave_fct
where owner = 'SYS'
/

— QUERY 2
select count(*)
from dave_fct df,
dave_dim dd
where df.owner = dd.owner
and dd.owner = 'SYS'
/

— QUERY 3
select count(*)
from dave_fct df,
dave_dim dd
where df.owner = dd.owner
and dd.owner_2char = 'SY'
/

results, and their implications

My results show that the statistics for queries 1 and 2 are pretty much spot on … the actual number of rows and the estimated cardinality were within about 3% — for query 3 the optimizer overestimated the cardinality by 324%. So what does this mean? Potentially it could prompt the optimizer to perform a full table scan instead of an index scan (standard disclaimer: not that I'm saying that full scans are necessarily a bad thing).

some possible solutions

So how do you get the optimizer better statistics?

One method would be to use a query such as:
select count(*)
from dave_fct df,
dave_dim dd
where df.owner = dd.owner
and dd.owner like 'SY%'
/
… which gives a very acceptable estimation in this case.

But this is a special case with its own advantages and disadvantages:

  • The "like" predicate is compatible with the use of histograms
  • This is a difficult technique to use for an In List query

So maybe …
select count(*)
from dave_fct df,
dave_dim dd
where df.owner = dd.owner
and substr(dd.owner,1,2) = 'SY'
/
Nope … the estimated cardinality is still over 300% too high.

How about creating a function-based index on the table?
create bitmap index dave_fct_ind on dave_fct (substr(owner,1,2))
compute statistics
/
Fast access on the query:
select count(owner)
from dave_fct
where substr(owner,1,2) = 'SY'
/
… but again no accurate statistics because there is no histogram on the index values, only on the table values. (Maybe an enhancement request … hmmm …) note: update above identifying this as incorrect information!.

a proposed solution

One method that I'm going to try is to push the higher level dimensional values down to the fact table. In our example, by placing the owner_2char attribute in the fact table and linking it to a snowflaked dimension table with owner_2char as the primary key we can use a histogram directly on that value.

So the demonstration script becomes:

drop table dave_fct;

create table dave_fct pctfree 0 compress nologging
as
select substr(do.owner,1,2) OWNER_2CHAR,
do.* from dba_objects do;

begin
dbms_stats.gather_table_stats
(
OWNNAME => user,
TABNAME => 'DAVE_FCT',
METHOD_OPT => 'FOR COLUMNS OWNER SIZE 254,
OWNER_2CHAR SIZE 254'
);
end;
/

drop table dave_dim
/

create table dave_dim
(owner primary key,
owner_2char not null)
as
select distinct
owner,
owner_2char
from dave_fct
/

create table dave_dim_p
(owner_2char primary key)
as
select distinct
owner_2char
from dave_fct
/

begin
dbms_stats.gather_table_stats
(
OWNNAME => user,
TABNAME => 'DAVE_DIM',
METHOD_OPT => 'FOR COLUMNS OWNER SIZE 254,
OWNER_2CHAR SIZE 254'
);
end;
/

begin
dbms_stats.gather_table_stats
(
OWNNAME => user,
TABNAME => 'DAVE_DIM_P',
METHOD_OPT => 'FOR COLUMNS OWNER_2CHAR SIZE 254'
);
end;
/

set autotrace on

select count(*) from dave_fct df, dave_dim_p dd
where df.owner_2char = dd.owner_2char
and dd.owner_2char = 'SY'
/

results of the proposal

Now, maybe because there are fewer distinct values of owner_2char than owner (17 instead of 25), or maybe because of blind luck, the estimation is actually better with predicates based on the former than it is on the latter … in my case an error of 0.54% instead of 3%.

Another advantage is the ability to miss out the join to the dimension table, of course. In my example case I got a result in 0.12 seconds instead of 0.23 by eliminating the dimension table from the query altogether — it would bear testing on a larger example of course, as dealing with wall clock times as low as that is not enormously significant.

So, downsides to this approach. Well as the higher values are deterministic upon the lower values there is little ETL overhead – no lookups to be performed, for example, so it doesn't seem to make ETL more complex. It does increase the row length by some amount, although that could be offset with data segment compression.

a common case for improvement

Now I particularly like the idea of this approach for date dimensions. As the dates in a table progress forwards in time there is varying skew on the higher levels of the dimension. For example if you have loaded data from 2001 up to the middle of January 2005 then two queries that filter on "Year = 2004" and "Year = 2005" can expect to retrieve significantly different numbers of rows from the fact table, and without this technique (or without placing the predicate on the level represented in the fact table) the optimizer is going to come up with something very funky — probably the same incorrect estimation for both of those predicates.

some final words

I am convinced enough by this method to give it a go. The implementation seems simple, the downside seems to be manageable, and the benefits seem to be sufficient.

I have about three dimensions that I'm going to try it on. Date is the obvious one. Another is related to the identifier for the fact table's source system and type of record, which is enormously skewed at higher levels. In fact the records are broken down into regular transactions and journal vouchers, in a ratio of something like 500:1, so a report on journal vouchers is looking for an index-based access method but a report on transactions may do better with table-based access.

Now, if you can see any reason why this is A Really Bad Idea, then please let me know by the end of the week when I potentially start implementing this plan. Words of reassurance also acceptable ;)

Advertisements

9 thoughts on “Higher Dimensional Values In Fact Tables

  1. I do hope it is not really a bad idea – we do this for one of customers where they needed to do “market share” type queries under Business Objects customers. In this case used we the product dimension and included product category in the fact table.

    The thing to watch out for is the need to rebuild the table if the hirearchy changes – time is probably OK (unless daylight saving changes ;-) ) But a product can be re-classified and may require a table rebuild or update.

  2. Yes, slowly changing dimensions would be a contra-indication for this technique for me, unless the benefit was very clear — have you got involved in bitmap-join indexes at all Pete? They sound like they’d be more suitable for that case.

  3. Not tried bitmap joins for a while – I found them a bit slow to build – but I think I’ll give them a go again now I have some decent hardware

  4. Just some addenda on the collection of stats on FBI’s.

    From 9.2 above, you can do all sorts of nice things with dbms_stats that are not apparent from the doco.

    eg’s (syntax might be wrong, but you’ll get the idea)

    a) for column SYS_C123$ size 10

    to collect a histogram on a particular column in an FBI (as seen from xxx_tab_cols)

    b) for clause 1, for clause 2, for clause 3

    to collect multiple histograms etc etc all in the same call

    hth
    Connor

  5. Thanks Conner,

    Didn’t know about that … seems that user_tab_cols show the hidden columns and user_tab_columns doesn’t … hmmm.

    A quick test in 10.1 shows that you can rename those system-generated column names also …

    SQL> drop table t;

    Table dropped.

    SQL> create table t as select * from dba_objects;

    Table created.

    SQL> create index tt on t (substr(owner,1,1));

    Index created.

    SQL> select column_name from user_tab_cols
    2 where table_name = ‘T’;

    COLUMN_NAME
    ——————————
    SYS_NC00014$
    SECONDARY
    GENERATED
    TEMPORARY
    STATUS
    TIMESTAMP
    LAST_DDL_TIME
    CREATED
    OBJECT_TYPE
    DATA_OBJECT_ID
    OBJECT_ID

    COLUMN_NAME
    ——————————
    SUBOBJECT_NAME
    OBJECT_NAME
    OWNER

    14 rows selected.

    SQL> alter table T rename column SYS_NC00014$ to SUBSTR_OWNER_1_1;

    Table altered.

    SQL> select column_name from user_tab_cols
    2 where table_name = ‘T’;

    COLUMN_NAME
    ——————————
    SUBSTR_OWNER_1_1
    SECONDARY
    GENERATED
    TEMPORARY
    STATUS
    TIMESTAMP
    LAST_DDL_TIME
    CREATED
    OBJECT_TYPE
    DATA_OBJECT_ID
    OBJECT_ID

    COLUMN_NAME
    ——————————
    SUBOBJECT_NAME
    OBJECT_NAME
    OWNER

    14 rows selected.

    So, here’s a nice interview question … “What is the meaning of the following query:”

    select table_name, column_name
    from user_tab_cols
    minus
    select table_name, column_name
    from user_tab_columns;

    :(

  6. It would be nice to rename hidden columns to something useful, but we tried this a while back….

    SQL> create table T ( x number);

    Table created.

    SQL> create index TX on T ( x+123);

    Index created.

    SQL> create index TX1 on T ( x+100+23);

    Index created.

    and now we export…

    Connected to: Oracle Database 10g Enterprise Edition Release 10.1.0.4.0 – Production
    With the Partitioning, OLAP and Data Mining options
    Export done in WE8MSWIN1252 character set and AL16UTF16 NCHAR character set

    About to export specified users …
    . exporting pre-schema procedural objects and actions
    . exporting foreign function library names for user MCDONAC
    . exporting PUBLIC type synonyms
    . exporting private type synonyms
    . exporting object type definitions for user MCDONAC
    About to export MCDONAC’s objects …
    . exporting database links
    . exporting sequence numbers
    . exporting cluster definitions
    . about to export MCDONAC’s tables via Conventional Path …
    . . exporting table T 0 rows exported
    . exporting synonyms
    . exporting views
    . exporting stored procedures
    . exporting operators
    . exporting referential integrity constraints
    . exporting triggers
    . exporting indextypes
    . exporting bitmap, functional and extensible indexes
    . exporting posttables actions
    . exporting materialized views
    . exporting snapshot logs
    . exporting job queues
    . exporting refresh groups and children
    . exporting dimensions
    . exporting post-schema procedural objects and actions
    . exporting statistics
    Export terminated successfully without warnings.

    So far so good… but

    SQL> alter table T rename column SYS_NC00002$ to fbi1;

    Table altered.

    SQL> alter table T rename column SYS_NC00003$ to fbi2;

    Table altered.

    and now we export again

    About to export specified users …
    . exporting pre-schema procedural objects and actions
    . exporting foreign function library names for user MCDONAC
    . exporting PUBLIC type synonyms
    . exporting private type synonyms
    . exporting object type definitions for user MCDONAC
    About to export MCDONAC’s objects …
    . exporting database links
    . exporting sequence numbers
    . exporting cluster definitions
    . about to export MCDONAC’s tables via Conventional Path …
    . . exporting table T
    EXP-00056: ORACLE error 3113 encountered
    ORA-03113: end-of-file on communication channel
    EXP-00056: ORACLE error 24324 encountered
    ORA-24324: service handle not initialized
    EXP-00056: ORACLE error 24324 encountered
    ORA-24324: service handle not initialized
    EXP-00056: ORACLE error 24324 encountered
    ORA-24324: service handle not initialized
    EXP-00056: ORACLE error 24324 encountered
    ORA-24324: service handle not initialized
    EXP-00000: Export terminated unsuccessfully

  7. Oh ho, so not an ideal situation at all then.

    For what it’s worth, Data Pump doesn’t like it either …

    Export: Release 10.1.0.2.0 – Production on Wednesday, 27 July, 2005 7:44

    Copyright (c) 2003, Oracle. All rights reserved.

    Connected to: Oracle Database 10g Enterprise Edition Release 10.1.0.2.0 – Produc
    tion
    With the Partitioning, OLAP and Data Mining options
    Starting “DAVE”.”SYS_EXPORT_TABLE_01″: dave/******** tables=t DUMPFILE=table2.d
    mp
    Estimate in progress using BLOCKS method…
    Processing object type TABLE_EXPORT/TABLE/TBL_TABLE_DATA/TABLE/TABLE_DATA
    ORA-39014: One or more workers have prematurely exited.
    ORA-39029: worker 1 with process name “DW01” prematurely terminated
    ORA-31672: Worker process DW01 died unexpectedly.

    Job “DAVE”.”SYS_EXPORT_TABLE_01″ stopped due to fatal error at 07:47

  8. My question is regarding DATE_DIM. In our environment designers insist on having surrogate key for date and storing that surrogate key in fact table. Here are some of their reasons.

    I personally disagree with every point below well… except maybe #1 however that is true reason. What would be your opinion?

    Thank you.

    1. Surrogate key values are smaller size than actual date values. Surrogate key values start with 1 and sequence up. By storing these in the fact table vs. dates you get better query performance on joins. DATE – 8 bytes, SURROGATE KEY – 1 to 5 bytes if # of dates/days is between 1 and 99,999 – 64k records in Shared Date Dim
    2. Surrogate keys provide the ability to represent dates that haven’t happened yet, are not applicable – i.e. the source does not provide the data for the field, or that are invalid. This relates to the “Null handling” topic we touched on briefly which is a broad term that refers to the above scenarios. Essentially, you always want referential integrity between dimension and facts. Sure, you could default your fact record and associated dimension record to a fake date like 12-31-9999 with the notion that this date refers to NOT APPLICABLE and so on but I’d rather reflect this in the dimension table once.
    3. Best practice to buffer your dimensional model from using smart keys. This is especially true of other dimensions but as a best practice you should do the same with your date dimension.
    4. Surrogate key values preserves the key order where 1 is the earliest date and on up which can be useful for partitioning facts and more performance-friendly I suspect too – smaller indices.
    5. By storing dates in the fact table directly, even with join to Date Dimension in place, it might encourage people to bypass the dimension table completely and we’d rather have them issue their query-filter on the dimension itself.

  9. Sorry for the delay in answering, Bakunian. Your comment got identified as spam and I was slow in checking the spam queue.

    To take the points in turn …

    i) With compression of the fact table the difference there is likely to be extremely small.

    For the dimension table the thing to do is to exactly quantify the difference. You can get an exact measurement of the difference with the following query …


    select
    sum(vsize(sysdate))/1024,
    sum(vsize(rownum))/1024,
    sum(vsize(sysdate))/sum(vsize(rownum))*100
    from
    dual
    connect by 1=1 and level

    ... from which I get 684kb for the date and 380kb for the number, so the date is going to require about 80% more space (uncompressed). The difference itself though is absolutely trivial.

    2. Dates that haven't happened yet -- I don't see why this is a problem.

    If you have extensive issues with invalid dates (30-feb-2006 for example) or corrupt dates ('111-d$%66' for example) then I wouldn't try to represent those in the dimension table -- they are degenerate values that ought to go in a different column of the fact table "Bad Date" and leave the real column preferentially null or less preferentially with a marekt value.

    3. Ugh, rule of thumb. A date value, as I explained here https://oraclesponge.wordpress.com/2005/06/28/dates-as-key-values-natural-or-synthetic/ , is really a synthetic value for which Oracle systems world-wide use the same internal lookup table. If you are given a value in your source data of "21-dec-2006" then why create your own synthetic lookup when Oracle does it for you anyway. It is really really unlikely that you are ever going to get a source system coming to you and saying "all those '21-dec-2006' codes we sent you actually mean '30-feb-2004'" so that you can go away and change your lookup accordingly (ie. the real benefit of synthetics)

    4. This argument says "synthetics are no worse than real values for dates". Unfortunatly the value of synthetics is that they don't have to represent the same order as the real value -- if you hold the designers to this then they have just neatly disposed of one of the prime reasons for using synthetic values. However they are at least recognising that using synthetics has a major weakness when it comes to range partitioning -- that using them "as nature intended" means that you can't partition by range. This alone is a major reason for not using synthetic values for dates.

    5. Why? There is no problem with filtering directly on the fact table -- in fact it sometimes is advantageous to do so. Oracle will even transform the query in many cases using predicate closure to do exactly that.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s