It seems to me that there is sometimes a very difficult balance to strike, between optimizing the structure of a database for fast and convenient ETL loading and fast and convenient report querying.
Normally the choice of how to partition a fact table is pretty straightforward, because the data being loaded arrives in an order compatible with querying — ie. all of Tuesday’s transaction are available to the data warehouse ETL process early on Wednesday morning, and during the day the users will be running reports with such filters as “Give me Tuesday’s data”. Thus you can partition on a daily basis and accomodate both a daily ETL process of loading via partition exchange, and a daily reporting process for which the optimizer can invoke partition pruning.
So far so good — if the units in which the data becomes available is also a unit by which the end users want to report then the decision on partitioning is simple.
Now consider a case where all the data relating to a particular time period is not available immediately. An example here might help. Suppose you run an auto parts store, and as part of your business you sell replacements for broken-but-repairable items, such as a starter motor. When Joe Lunchpail comes into the store on Saturday afternoon to buy a new starter motor you tell him, “That part costs $100, but if you can bring in the old broken part then we’ll give you a $20 credit for it” — ie. there is a core charge of $20 on starter motors. Being a trusting organization you give him that $20 credit on the spot, as long as he uses a credit card for the purchase — if he doesn’t bring in the part within a week then you’ll charge him that extra $20.
By the following Saturday evening Joe hasn’t come back, so you charge him that $20. On sunday morning he comes in, hands over the part, and the system refunds him that $20. Three transactions in all, each one on a different date, but all of them relating to a customer interaction that occured on the first Saturday on which the starter motor was bought, and all of them linked back to it. In fact on the majority of the financial reports users are interested in “Original Transaction Date” (the first Saturday), not “Actual Transaction Date” (the three different dates on which the transactions really took place).
So imagine this being loaded into the company’s data warehouse. For query reporting purposes it would be desirable to partition by “Original Transaction Date”, but for ETL loading purposes it would be desirable to partition by “Actual Transaction Date”. Which is the more appropriate method to choose?
Well, here’s some pro’s and con’s.
Partition by Original Transaction Date
- User queries run fast due to partition pruning, and partition-level statistics give optimizer better histogram information*.
- ETL process cannot use simple exchange of populated table with empty partition. More complex method required
Partition by Actual Transaction Date
- ETL process uses simple partition exchange of populated table with empty partition.
- User query performance compromised by lack of partition pruning and use of global statistics.
* Just a quick note on this: For a partitioned table Oracle can store multiple sets of statistics. One set at the global (ie. all data in table) level, then one set for each partition, and one set for each subpartition. If a query accesses multiple partitions then global statistics are used by the optimizer in lieu of partition-level statistics, and this can be a real problem where histograms vary between different partitions. For example the partitions holding retail data for the period immediately prior to Father’s Day will probably have very different histograms for those immediately prior to Mothers’ Day, and will also have a different variation in the number of transactions per day. (I’m basing this on my own ability to leave shopping to the last minute in comparison to my wife’s).
Moving along …
There is a very clear implication in all this — if your partitioning scheme is not compatible with the report queries then you are probably wasting your time collecting partition-level statistics (unless you then use those to build global statistics, such as “Number of Rows”). And that probably means that you are seriously compromising the ability of optimizer to define efficient execution plans.
You see where I’m leading with this … the most important element of the system is report performance, and the partitioning scheme ought to be complimentary to that rather than to the ETL process. Hey, nobody said that ETL is easy — conventionally it is supposed to represent 70-80% of the effort of a data warehousing project, I believe, which sounds about right.
So, what are the alternatives to partition exchanging as a load mechanism for fact tables? Bearing in mind that the fact table may be using compression, and is almost certainly using bitmap indexes, the most feasible approach seems to be a direct path insert of data into the fact table itself. Now if your business model guaranteed that you would never receive records that are more than a few days old, then it might be feasible to either perform multiple partition exchanges to refresh old partitions, or to use a hybrid solution in which a partition exchange is performed for the newest day of data (presumably the bulk of the new feed) and direct path inserts for the remaining few older records.
However I have a feeling that being liberated from partition exchanges may sometimes be beneficial. What it implies to me is that you are then much more able to define a more complex partitioning scheme to further complememnt the report queries because the ETL process is not made any more complex by increased complexity of partitioning. For example, where previously you might have used a partitioning scheme based on Range(“Actual Transaction Date”) for ETL compatibility, you can now potentially improve report performance by using a composite range-list partitioning scheme such as Range(“Actual Transaction Date”,”Retail or Wholesale”)-List(“Region Code”). As long as the internal workings of Oracle don’t hiccup on some rarely used partitioning scheme (ie. ORA-00600, admittedly a big factor) the details are immaterial to the ETL process.
Now another side-effect may be in the application of Materialized Views (MV’s) – in particular, refreshing them. I’ll be honest here — I’ve thrown my heart and soul into MV’s, but I’ve nearly always received a cruel rejection for my efforts. Getting the correct astrological alignment that will permit me to use fast refresh (on commit or on demand), partition change tracking, query rewrite, and partition exchanges in a production environment has so far eluded me, but maybe by eliminating partition exchanges (which were the source of my last MV-related debacle) I will finally find happiness. Who knows?