The Importance Of Partition Key Statistics

A partitioning-related performance problem came my way last week that had me scratching my head, until a 10053 trace illuminated the path to a solution.

The problem was one of poor performance for a query that joined two range partitioned tables (having different partitioning keys — in fact one is multicolumn range/list composite partitioned, the other single column range partitioned) via two dimension tables. The immediate sign of a problem was the extensive use of nested loop joins, and that led to the observation that the cardinality of the result set from one of the partitioned tables was low. Initially I thought that this was due to the use of global statistics, as the query required the use of dynamic partition pruning (albeit to a single partition). When the predicates were manually rewritten against the fact table itself the estimated cardinality reduced even more — in fact it was now 10,000 times lower than the real cardinality, being 388 against a real value of nearly four million.

Applying dynamic sampling hints at level 5 for the fact table corrected the estimation very well, but as the stored partition statistics were accurate there was more investigation to be done.

A 10053 trace was executed, and the following entry noted.

SINGLE TABLE ACCESS PATH
Column: FISCAL_MON  Col#: 3      Part#: 84     Table: FCT_FSA_1   Alias: FSA
    NO STATISTICS  (using defaults)
    NDV: 121206    NULLS: 0         DENS: 8.2504e-06
    NO HISTOGRAM: #BKT: 0 #VAL: 0

As is often the case, the penny dropped immediately — no statistics were stored at the partition level for the partition key column, and this was confirmed by querying USER_PART_COL_STATISTICS. The reason for that was our extensive use of the DBMS_STATS method option clause “FOR ALL INDEXED COLUMNS SIZE 254″. Because the column has a single value per partition we are not indexing it, therefore the column was not subject to statistical analysis, therefore the optimizer was making an incorrect assumption of the cardinality.

In fact this column had been indexed in the past but a rash of performance problems involving inappropriate use of the index had caused us to drop it. From that time onwards the partitions contained no statistics on that partition key column.

I put together a script to identify all partition key columns that are missing statistics at the partition level — I believe I’ll also have to do the same for subpartitions, although in that case I’ll be looking for subpartition and partition key columns that have no statistics stored at the subpartition level.

A similar query allows the identification of those columns that require statistics to be gathered for particular partitions. I could either reanalyze the partitions or use DBMS_Stats to set the statistics manually, which ought to be pretty straightforward as the appropriate values can be derived directly from the partition names. However life is too short for the learning curve on that second option at the moment, so re-analyze it will be.

And so on to how we fix our processes to avoid this in future.

  • Idea 1: Start specifically naming the columns for which we want to gather statistics
  • Idea 2: Use the REPEAT clause to collect histograms only on the columns that already have them

I’m not sure that the REPEAT is what we want though. I think I’m more comfortable with just listing the columns and the appropriate size, which I’ll probably specify either as 1 (no histogram) or 254 (histogram). This number would obviously change at the partition and the global levels for columns such as FISCAL_MONTH for which there is a single value at the partition level but multiple values at the global level.

  • Idea 3: I wonder whether there is value in creating a new framework for specifying statistics gathering parameters, involving a series of metadata tables that can be used to specify global, partition and subpartition-level options for building DBMS_Stats.Gather_Table_Stats calls dynamically? A table of TABLE_NAME, COLUMN_NAME, GRANULARITY, SIZE_CLAUSE, something like that. Sounds like a reasonably large undertaking that’s not going to happen this afternoon, though.
  • Idea 4: Dynamic statistics gathering produced such a good estimate that I wonder whether there is value in just dropping all statistics from the table, global and local, and handing the job over to query-time estimation? It worked well for the query in question, but how would more complex queries fare? How would it impact query rewrite?

Hmmm.

About these ads

12 thoughts on “The Importance Of Partition Key Statistics

  1. I hate statistics! Most of our performance problems come from a combination of one or more of: missing statistics, inappropriate statistics, or just plan stale ones.
    Recently we had a problem where the method of collection was changed and left part of a partitioned table with column stats and part without and (of course) very suspect global column statistics.

    I think there is mileage in idea 3 – I doubt there is a one size fits all option in a DW – we already do something like this as reference and fact have different needs as do global and partitioned fact. I must admit that we cheat a bit too, 6 days a week we ‘make up’ global stats as the sum of the partitions, but then we are just ignoring the column stats until the weekend.

    Oh, and we always include a bitmap index on the partition key column – it tends to be quite a small local index!

  2. Amen on the statistics-related issues. I feel that we really do need a robust and flexible framework for guaranteeing statistical accuracy and efficiency in gathering them.

    I keep going back in my mind to the idea of relying on dynamic sampling, though. I’ve felt in the past that there might be some benefit in investigating more thoroughly the possibility of supporting dynamic sampling through appropriate indexing and summary tables, but as I recall there are hints on the recursive SQL that would make that something of a challenge, and as soon as a query uses the SAMPLE clause (as the recursive dynamic sampling SQL’s do) it rules out the possibility of using such shortcuts.

    I partly regret not spending more time on finding out why the optimizer was suddenly so enamoured of the local indexes on the partition keys before I dropped them. At first thought I’d imagine that it was exactly because of a lack of partition level column statistics, but since we were collecting statistics “for all indexed columns” that seems less likely — not impossible, though.

  3. Pingback: Slowing queries « Pete-s random notes

  4. Pingback: More on Partition Key Statistics, and an Oddly Harmful Constraint « The Oracle Sponge

  5. Idea 3: I wonder whether there is value in creating a new framework for

    It sounds great!
    Do you need a hand? :)

  6. IMO method_opt=>’FOR ALL INDEXED COLUMNS SIZE 254′ is the worst possible statistics gathering strategy. With one step you create two problems:

    a) you do not gather statistics at all on some important columns – as you found out. Just because a column is not indexed does not mean it doesn’t need statistics. And that is not limited to partition keys.

    b) you create histograms for columns that do not need it, or worse, should definitely not have any. Combine that with gathering statistics by sampling and you get wrong histograms. Histograms can be useful if the data distribution of a column is skewed. By sampling you at the very least miss some of the infrequently occurring values. On top of that sampled column histograms are not properly prorated (see my Histogram – Myths and Facts paper). And to add further insult to injury, the histogram gathering procedure is broken in 10.1.0.5 (and 10.2.0.1).

    The simple strategy of method_opt=>’for all columns size 1′, augmented by selective gathering of histograms avoids at least some of the problems people are having with statistics. And it is also faster since the size 1 statistics for all columns of a table can be gathered in one step together with the table statistics wheras ‘for all columns size 254′ requires a separate scan for every indexed column.

  7. Good comments, Wolfgang, and in general terms I agree entirely. Here are some self-justifying comments and other observations :D

    a) Yes — not all columns on which statistics are beneficial are neccessarily indexed. However in our environment we used extensive indexing, including every column of the fact table that joined to another, for a number of reasons …

    i) Because they were all potentially predicated at some time or other.
    ii) Because doing so presented us with no load performance or storage problems
    iii) Because we also used it as a shorthand to indicate what columns to gather statistics on.

    Of course, this failed us when some of those indexes became problematic. Mea culpa. (That’s Latin for “my bad”, I believe).

    Naturally the parse time is potentially higher, and bind-variable peeking is a potential issue, but I work in a world of high execution times and no bind variables at all.

    b) Yes, sampling is an issue for histograms — it’s less of a practical issue at the partition/subpartition level because we can afford to compute statistics for that.

    I did some rather empirical time-based tests quite some time ago (before I began blogging, I believe) which showed COMPUTE to be faster than estimation when lots of indexes were present until the estimation % was pretty low. I took that to be due to the SQL SAMPLE clause causing the indexes on the columns to be avoided in favour of table-based gathering, which was not as fast until the % estimate was very low. A COMPUTE on the other hand appeared to use available indexes. I guess I should address that issue again (I see that you mention it in regard to index statistics in your presentation http://www.centrexcc.com/Histograms%20-%20Myths%20and%20Facts.ppt.pdf, and I’m guessing that that is the reason).

    I wrote some thoughts on the nature of skew that probably chime with yours a while ago: http://oraclesponge.wordpress.com/2005/04/22/update-on-histograms/

    Final thought: I recall thinking a while ago that histograms on large tables were a potential minefield because in order to get good performance on the collection of histograms for indexed columns you need to COMPUTE, yet this generates at least one full table scan that I’d rather avoid — in other words, good statistics on the table can be collected through sampling, but good histograms on the indexed columns require COMPUTE. I’m not sure that I ever resolved that dilemma to my own satisfaction.

  8. Re your “justification” However in our environment we used extensive indexing, including every column of the fact table that joined to another, for a number of reasons …

    I strongly believe that every column should have basic statistics (num_distinct, min, max, avgclen), whether indexed or not. Even if it is never used in a predicate, it may be used in a select (else why would it be in the table) and the avgclen helps the CBO estimate the size of a potential sort.

  9. David,

    Thanks for this insightful post. It is helpful. I had issues recently with histograms on my DSS database. After discussing on Oracle-L and also interacting closely with Wolfgang, I am more than convinced that ‘For all columns size 1′ is safest and effecient.

  10. HI

    I ran that query on two tables and this was the result.

    PARTITIONED_TABLE KEY_COLUMN TOTAL_PARTITIONS NO_STATS NO_STAT%
    ——————– ——————– —————- ——– ——–
    Tableb SK_LDOM_NO 50 2 4
    tablea SK_LDOM_NO 50 2 4

    This is how we analyze our table: (Obviously for all 50 partitions)

    exec dbms_stats.gather_table_stats(ownname => ‘MIS’, tabname => ‘Tableb’, partname => ‘Tableb_20120331′, estimate_percent => 20, block_sample => TRUE, granularity => ‘PARTITION’, cascade => FALSE, force => TRUE);
    exec dbms_stats.gather_index_stats(ownname => ‘MIS’, indname => ‘BIDX_SKLDOMNO_Tableb’, partname => ‘Tableb_44′, estimate_percent => 20, granularity => ‘PARTITION’, force => TRUE);

    After analyzing the table with the abovementioned code, the NO_STATS did not change. We moved over to 11g recently and experiencing bad performance.

    Am I analyzing incorrectly?

    thanks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s