Histograms For Join Predicates (or “Hooray for technical content!”)

Someone sent me a scenario the other day similar to this:

A small dimension table, for example of US State names, has a synthetic primary key (say STATE#) and a unique key on the real-world value (STATE_NAME), and a much larger fact-type table has an indexed foreign key to this lookup table. The distribution of values in the fact table is very skewed. Is it possible for the optimizer to take into account the skew of STATE# in the fact table when it is joined to the dimension table and a predicate is placed on STATE_NAME?

Well, a tricky issue. Oracle has to resolve the predicate on dimension.STATE_NAME into a predicate on dimension.STATE# and thence into a predicate on fact.STATE#. The optimizer will do a very similar, and very effective, operation that I’ve written about a couple of times before through dynamic partition pruning based on joins, so my first reaction was to suggest implementing a list partitoning, or list subpartitioning, scheme on the table. For me that would be the gold standard of approaches, but it might not be possible, or might be not completely effective, if the organisation doesn’t want to pay for the partitioning option, or the logical limits of composite partitioning have already been reached, or if a table rebuild is too large a hammer to take to this nut.

A related thought was that if the partitioning option was in place then a summary table using a partitioning scheme more sympathetic to this aim could be implemented. For sure it’s a big step to take, especially if it means duplicating the entire data set of the current fact table, but it would extend the range of queries that can benefit from partition pruning very handily. The summary table could of course include the join to the dimension table and/or aggregate the result set, giving further benefits.

Another thought that occured was to create a bitmap join index, effectively allowing predicates to be placed on a hidden column of fact.STATE_NAME. I ran a couple of tests on 10.2.0.2 and it appeared to be ineffective in respect of histogram recognition. The estimated cardinality was always 1% of the total number of rows in the (non-partitioned) sample fact table even with hidden columns analyzed. Maybe I missed something there, and if I did then please enlighten me as it’s a pretty easy implementation.

Here’s a funky technique that actually did work — create a function to accept STATE# and return STATE_NAME, either through querying the dimension table or through embedding the logic in a PL/SQL CASE statement. Once indexed columns had been analyzed a predicate placed on this function (instead of the dimension table) did result in correct estimation of cardinality. The list of objections to this technique is pretty broad, including the possibility of returning the wrong result once the dimension table changes and a few other issues, but whether the risk is worth adopting or not is up to the user. One might choose to use DBMS_Advanced_Rewrite in 10g+ to have predicates against dimension.STATE_NAME rewritten to the hidden column, but the feasibility of that was not part of the test I ran/

There is one last suggestion of course — rebuild the table to use natural keys instead of synthetic keys. This would yield the same problem if the lookup table contained a commonly predicated additional level in the hierarchy, (eg. CONUS_Y_N, TOO_HUMID_IN_SUMMER or HAS_HELMET_LAW) and we’d be back to “square uno” once more.

To sum up, here are the solutions I thought of:

  1. Partition or subpartition the fact table on STATE#. (preferred option)
  2. Create a summary table with partitioning or subpartitoning on STATE#. (uses most space and slows data load, but very flexible and powerful)
  3. Create a function-based index on fact to perform the lookup, and query that value instead. (a bit flaky, but it works without major system impact)
  4. Rebuild the fact table based on the STATE_NAME instead. (still limited in multi-level hierarchies)

All other suggestions gratefully accepted. I’m really puzzled by not be able to achieve this with a bitmap join index, I must say. I’m sure I must have missed something simple

Here is my test script for the function-based index method, and here is the script output on 10.2.0.2.

EDIT: Following on from Pete’s suggestion in the comments, here is a script showing the attempt at a bitmap join index method, and again the output on 10.2.0.2.

7 thoughts on “Histograms For Join Predicates (or “Hooray for technical content!”)

  1. I’m a bit surprised about the bitmap join index – creating an index on the fact_table(dimension_table.state_name) should be a runner – but sometimes things get a bit twitchy and things are not as good as hoped

    How about joinback? – Do you have Oracle dimension objects set up if so the by having state_name an attribute of state# you may get a result…

    If only I was connected to a DB I’d test it ;-)

  2. Hmmm, good idea. I ought to have set up a dimension, even if it’s just one level.

    Unfortunately, no change there. I’ll post the script and output — for those without the benefit of an available DB :) … moment …

  3. And with the dimension but no bitmap join index?
    And if you create the BMJI witout the stats?

    Questions, questions…

    The bitmap join is not a ‘hidden’ column it’s even more abstract than that; so I am not sure collecting histograms on all columns would help. Perhaps a second bitmap on d.col1 but then that’s getting a bit silly.

  4. With dimension and no BJI, no recognition of histogram.

    I think that BJI and no stats is a mid-script result of the output I already published.

    Indeed, not a hidden column but they are somewhat similar to them (logically anyway), and as Connor McDonald says here https://oraclesponge.wordpress.com/2005/07/24/higher-dimensional-values-in-fact-tables/#comments there are a lot of things that can be done that are not always apparant from the docs. It’s really the only legitimate way of “attatching” those values to the table (summaries aside).

    Hmmm.

  5. Hard to resist that last comment.
    I think the answer is the index with virtual column (akia function-based index) which can be a bitmap index if you want it to.

    I think the reason why the bitmap join index does work is that the query still displays the table join, so Oracle uses the column selectivities to work out the join cardinality, then uses the join index because that gives the cheapest cost.

    To avoid the ‘flaky’ comment, you could create a view of the table which exposed the function(state#) as a column called state_name.

    Regards
    Jonathan Lewis

Leave a comment