Thoughts on Range-Range Composite Partitioning

As of 10gR2 there is No Such Feature as Range/Range composite partitioning, of course. Composite partitioning is limited to Range/Hash or Range/List, only the latter of which have I made much use of — which is not to say that Range/Hash does not have its place, just that its place and my place have not yet intersected.

However the Word On The Street is that 11g comes with Range/Range included, and that prompted some thoughts because I’ve been making much use of multicolumn range partitioning, so my thought is this: if we already have the ability to use multicolumn range partitioning then what is the benefit of Range-Range composite partitioning?

Well, first let us have a quick look at multicolumn range partitioning as it stands today.

Multicolumn Range Partitioning in 9i

In 9i multicolumn range partitioning is a tricky issue because of some restrictions on partition pruning, which is an awkward feature to have restrictions on because it is the major benefit of the partitioning option in a data warehouse. The major restriction in 9i is that partition pruning can only be carried out when predicates are placed, either directly or indirectly through joins to predicated dimension tables, on contiguous leading columns. By this I mean that if you partition on three columns A, B and C then pruning can take place when predicates are placed on the following combinations of columns: A, A & B, or A & B & C. Pruning cannot take place when predicates areplaced on the following combinations of columns: B, C, B & C. When you place predicates on A & C then pruning is achieved on A only. (I’ll demonstrate that in a follow-up posting).

This limitation is analagous to the that on the use of multicolumn indexes prior to the introduction of the index skip and fast full index scan access methods.

Side note: placing a predicate on a column indirectly through a join to a predicated dimension is not always sufficient to provoke the optimizer to prune. See this note for more information.

One way of providing more flexibility in partition pruning options under 9i is to identify a column having few distinct values, and to implement that column as a list subpartitioning key. Subpartition pruning is handled very effectively and is independent of partition pruning. 

Side note: well we should say “having few distinct values on which partition pruning is commonly desirable”, since we can isolate these few values in one or more subpartitions and let the rest fall to the “default” subpartition. Thus the number of subpartitions is a managably small number, notwithstanding the benefits we gain from the use of subpartition templates. 

Another implementation note for 9i is therefore that the order of columns in the partitioning key becomes very important. To take a Real World Example, I “upgraded” a general ledger fact table in a financial data warehouse to include a new source of funds, represented by a Char(2) column “FUND_CD”. Previously the fact table had included but a single value of fund_cd and it was now to include additional records having a new value. I implemented the fund_cd as the leading column of the multicolumn-range/list composite partitioning because all of the standard reports against the fact table required the selection of a fund code. Therefore pruning on the first column was a given, and if a filter was placed on the second column of the partitioning key (Fiscal Month) then partitioning was achieved on both columns.

Side-note: of course a requirement later emerged for some reports to include both fund codes. Partitioning pruning was achieved on Fiscal Month by (redundantly) selecting both fund codes.This was implemented in Business Objects by adding a self-join to the fact table of “fund_cd in (‘Value1′,’Values2’)” so that it was included in all reports. No adverse effects on the CBO were noted.

In that example list subpartitioning was also implemented on general ledger account codes to group them into sets representing sales, returns etc. with the majority falling into “default”. Thus a report on general ledger codes solely in the category of “sales” for a set of fiscal months and a fund code led to very precise partition and subpartition pruning.

Multicolumn Range Partitioning Enhancements in 10g

In 10gR2 we have more flexibility in pruning on multicolumn range partitioned tables. The optimizer will prune on any combination of predicated partition-key columns. This leads to the possibility of implementing extreme examples of multicolumn range partitioning to achieve very flexible pruning on any combination of a number of different columns. Theoretically this could mean partitioning on every commonly predicated column of a fact table, and lead to a reducation in the number of columns to be indexed.

Of course in practice there may be some objections to this. Partitioning by day and ten location codes and three sale type codes and six payment type codes leads to 65,700 partitions per year of data (plus possible subpartitions) and the load of a day of data by partition exchange requires 180 DDL operations.

Side-note: An issue closely related to this is that this technique leads to very high data object id’s, which was recently raised by Jonathan Lewis as a potential cause for concern.

On the other hand the reduction in required indexing reduces the amount of redo and indexing overhead that would be associated with using direct path loads into the fact table as an alternative to partition exchanges (and slightly mitigates the rate of increase for the data object id).

Another consideration is that statistics for the table data are held at the subpartition, partition and global levels. When the optimizer cannot statically prune to the single subpartition or partition level it resorts to the use of global statistics as the basis of optimisation, and for skewed data sets this can be deceptive to say the least.  (The same issue applies to dynamic pruning as well, of course). The estimation of the number of rows to be retrieved from the fact table is critical in determining the access paths back to the dimension tables. Most frequently this number of rows is at least an order of magnitude greater than the number of rows in the majority of dimension tables and the optimiser tends towards hash joining from the fact table to the dimension tables.

However when the estimated cardinality of the fact table data set is too low then the optimizer can tend towards a number of other joins methods, in particular a series of nested loop joins. The optimizer will only favour these for smaller sets of data as it believes that the increased cost of each row joined in comparison to the hash join is offset by not incurring the initial cost of hashing the dimension table. This can be extremely inefficient.

Potential Benefits of Range/Range Composite Partitioning

The difference between multicolumn range and composite range/range partitioning schemes reduces to the ability to isolate a subset of the partition key columns into their own logical layer — ie. the subpartition level.

The value of this is threefold.

Firstly, we gain the ability to apply fewer DDL operations to the manipulation of range-based subsets of data. In the example that we looked at before where 180 DDL operations would be required to load a single day of data we might push the location, sale type and payment type ranges to the subpartition level. Thus we would stage the data in a table that is range-partitioned by those three columns and perform a partition exchange of the multicolumn range subpartitioned fact table with the staging table. This assumes that any composite range/range subpartitioning would include the ability to specify multicolumn range subpartitioning, of course.

Side note: This is not to say that the internal operations of the partition exchange would necessarily be more efficient, only that fewer lines of code and DDL operations would have to be issued by the ETL process, and that the exchange would become a single transaction instead of 180 with the consequent improvement in robustness allowed by the more simple recovery from any type of failure in mid-operation.

It presumably will also become possible to specify a subpartition template to simplify the addition of new partitions, avoding the need to specify and perform a high number of DDL operations to add new partitions.

Secondly, and staying with that same example, we would be able to consider in our design the level at which we want to store statistics. In either case, the multicolumn range partitioning or the range/range subpartitioning, we would be storing statistics at the most detailed level when a single partition/subpartition is isolated through static partition pruning by the specification of appropriate day/sale type/location/payment type. However in the case of multicolumn range partitioning we only have statistics at that level and at the global (table) level. In the range/range subpartitioning case we might have statistics at those same two levels and also at the day partition level, which might be very convenient for a particular class of query.

This applies only to static single partition pruning of course, and not to multi-partition or dynamic pruning.

Thirdly. it is possible that subpartition pruning might be more efficiently specified by the cost based optimizer, in that it could specify “prune to the first, second and third subpartition of every partition” instead of “prune to the following (long) list of specific partitions”. This is just a theoretical notion though, with no basis in experience.

Summary

There are evidently some theoretical benefits to be gained from a new ability to perform range/range composite subpartitioning. In comparison to the multicolumn pruning abilities of 10g the benefits would not appear to be of such magnitude that there is a clear case for adopting such an ability though. Of the potential benefits considered above I would rate the ability to manipulate the table with fewer DDL operations as the primary one,and the ability to include an additional level of statistics as the secondary one.

These benefits would have to be weighed against the potential for bugs in a new code path, and on balance I’d stay away from the new feature for a little while.

3 thoughts on “Thoughts on Range-Range Composite Partitioning

  1. At first I thought a missing gap was to be plugged, but in reality other than for time what sort of data suits range partitioning? Ranging on a surrogate key seems a little contrived and is no better than hash sub-partitioning (unless the surrogate key has meaning ;-) )
    I suppose if your source system was an ERP system that used hierarchical keys so that (for example) the first two digits of a 6 digit product code identified its type (10=Frozen food, 34 = textiles) you could use ranging, but to my mind that would be better served by being able to use a join to a dimension table and partition on a parent key.
    Now that would be a useful feature, especially if we did not have to materialized the join

  2. It seems conceivable that one might manipulate synthetic keys to make range partitioning simulate the ability to partition on a parent key.

    To take the retail example one might effectively assign a prefix of 10 to all frozen food items and of 34 to textiles by multiplying the prefix by 1,000,000 and adding it to an Oracle sequence number to define the product key, enabling range partitioning on that new key value.

    Dynamic partition pruning based on “Frozen Foods” would therefore resolve to a smaller set of partitions/subpartitions.

    On the other hand one might define a multicolumn join to the fact table based on the product key and on a key value for the category (10, 34 etc), and partition on category only. I wonder how that would work out? I’m tempted to think that you might as well break the category out into a new dimension but I can see drill-down issues with that — effectively drill-down would become drill-across. I can see that being a much more valid approach if the relationship between product and category was subject to change.

Leave a comment