Choosing Partitioning Keys: For ETL or Reporting?

This is one of the recursive thought processes that has been spun off from considering ways of optimizing the fast refresh of multiple materialized views: I must now be at a recursive depth well into double figures because I cannot recall how I got here at all. Anyway …

I was just browsing the 10gR2 documentation, and noticed the following comments here:

“The partitioning scheme of the data warehouse is often crucial in determining the efficiency of refresh operations in the data warehouse load process. In fact, the load process is often the primary consideration in choosing the partitioning scheme of data warehouse tables and indexes.

The partitioning scheme of the largest data warehouse tables (for example, the fact table in a star schema) should be based upon the loading paradigm of the data warehouse.”

I have no argument with “The partitioning scheme … is often crucial in determining the efficiency of refresh operations …”, nor with “… the load process is often the primary consideration in choosing the partitioning scheme …”: the former statement is self-evidently true and I’m sure that the latter is also true. However I do have a problem with, “The partitioning scheme … should be based upon the loading paradigm …”. This strikes me as being not a very well thought-out piece of advice that depends more on an appeal to historical precedent or coincidence than it does to a rational design process. And it neglects the critical point that the partitioning scheme is always crucial in determining the efficiency of reporting operations.

I recently wrote something that touched on this subject in a roundabout way by mentioning that I was migrating a fact table from a partitioning scheme that aligned with the ETL process to a new scheme that aligns with the reporting requirements. Throughout the migration I keep thinking of events Quite Some Time Ago in which the administrators of an OLTP database objected to the addition of new indexes on a couple of columns. Their rationale for this was superficially sound: that it would have slowed down both the insert of new rows and the subsequent update of those columns. While they were not discounting the benefits of the indexes for select operations they were treating the pros and the cons as if they were on a level footing. In fact at the time that the row was inserted the pain was already half over – a few days after the rows were inserted the newly indexed column’s value would change once and from there on there were only benefits to be gained from the new index. Substantial benefits they were too: two-orders of magnitude improvement in the performance of some key reporting queries.

Now that isn’t the be all and end all of the considerations. To be completely fair we should also consider the sensitivity of the system (or the business) to the effects of the changes at the time that they are experienced. Taking the example above, if the loading of new data or the subsequent change to the indexed value was a critical process on which there were particular time restrictions, while the select statements that benefited were part of a batch operation with no current performance problems (in terms of whether the business was impacted by it running for longer than it could have with the index in place) then the technical benefits of the additional indexes might have been outweighed by operational issues.

So to appreciate the impact of a design feature on a system you not only have to consider whether those features represent potential benefits and detriments, but you have to weight them by a couple of other factors:

  1. How often the benefit or detriment occurs
  2. How sensitive the system’s operations are to the benefits and detriments occuring when they do.

So back to the matter of the partitioning.

Hopefully the parallels between the situations are pretty clear. We have design choices (to index or not to index … to partition according to the convenience of either the ETL procees or the reporting queries) that impose costs and benefits on different operational stages in the data handling.

Now let us look at the costs and benefits of taking different decisions in the partitioning, and we will start here by stating the obvious: that there is only a difference between the two partitioning schemes when the ETL and the reporting processes use different logical bases for identifying record sets of interest. If your data arrives in sets that correspond to the reporting requirements then this is not an issue.

I’ll not rehash on the situation that caused the migration from an ETL-based partitioning scheme to a report-based partitioning scheme, but in brief the users are interested in reports based on the date of a financial transaction, but financial transactions often arrive very “late”. For example, transactions for March 2005 could be transmitted in September 2005.

Let us suppose then that we follow the advice given in the documentation. We partition our fact table by some natural or “synthetic” date (synthetic in the sense that it does not represent a business object) that represents the arrival period of the data, say “DAY_OF_FILE_ARRIVAL”. This is a great convenience to the ETL programmer, who can transform the new data set into a table having the same structure as the fact table (without partitioning, or partitioned if the fact table is composite partitioned), create or rebuild indexes on it, analyze it, handle new or changed dimensional values, then perform a partition exchange with the fact table to make the new data available to the user community.

However the users are then querying by “DATE_OF_TRANSACTION”. For example they specify a range-based predicate of DATE_OF_TRANSACTION between ’01-jan-2005′ and ’31-Jan-2005′. As this represents a small fraction of the data in the fact table, and as the rows are fairly well clustered, an index-based access method is used to retrieve the rows. In fact until the number of fact table blocks to be touched starts to exceed some threshold like 10% of the total, an index-based access method can be expected, with the undesirably high number of logical reads and small physical read size that you’d expect.

So in this scenario partition pruning for the reporting queries has been lost and with it has gone arguably the single most effective means of enhancing query performance against a large fact table, short of slapping materialized views all over it or using parallel query (which I take rather for granted in a data warehousing environment).

If we adopted the alternative partitioning scheme based on DATE_OF_TRANSACTION then partition pruning sanity would return to user reports, but at what cost to the ETL process?

First, let’s consider what alternatives there might be to a simple partition exchange-based loading method. In practice there is just one — the direct path insert, either serial or parallel. While a full quantitative comparison of the partition exchange and direct path inserts is not not going to be presented right here and now (although maybe I’ll add it to my list), I feel that direct path inserts are not actually such a bad method for fact table loading. I’d consider the major disadvantage to be that they generate more redo because you are modifying enabled indexes for which the nologging attribute is not respected. In terms of manageability the pros and cons are mixed. On the one hand, you no longer need to worry about identifying which partition the data has to be exchanged with (if you ever did worry about that).

As far as statistics are concerned you do need to either allow monitoring and DBMS_Stats to handle the identification and collection of modified table statistics, or you need to start worrying about partitions again and implementing your own mechnaism for refreshing partition-level statistics. I have a feeling that while the latter approach may be more trouble by several orders of magnitude, it might still be preferable to the monitoring option. As the documentation here states, “If a monitored table has been modified more than 10%, then these statistics are considered stale and gathered again.” While I am no fanatic about making sure that statistics are right up to date you would certainly want to think about whether this threshold is low enough, particularly in regard to histograms – it doesn’t take a 10% change in the number of rows for a histogram to be significantly altered, particularly if business conditions and data distributions change with time.

I can’t believe that I just set myself up for another blog topic. * sigh * Another recursive layer — I need a vacation.

Another consideration is how well the batches of new records correlate with the partitioning scheme. Maybe your batch of new records is scattered all over a bunch of partitions, or maybe nearly all of the data is going to a single partition with a few rows scattered over a handful of others. In fact I prefer to think about this the other way around — instead of thinking about this from the view point of how the new data is going to be split I’d rather consider what is ging to happen to the fact table partitions. Here are a few different scenarios:

  • One empty partition to be populated, one other partition to gain around 1,000 rows (0.01% of current total)
  • One empty partition to be populated, six others to gain between one and three million rows (1-3% of total), five others to gain between 10 and 1000 (0.001-0.1% of current total)
  • All partitions to gain around one million rows (1% of total)

How you treat these different scenarios, and how confortable you are with them, depends on where the breakeven point lies for you in your tolerance of ETL complexity and performance, and probably therefore on your ETL tool also.

When a partition starts out with data we have a couple of options.

  1. Copy the data into a new/empty table in nologging mode, insert the new data, rebuild indexes, then partition exchange with the original.
  2. Direct path load into the original partition, again with the option of disabling indexes where possible.

Which one of these approaches works best for you depends on a few factors. The former approach is certainly more challenging to program, while the latter will incur higher overhead on logging (although the overall i/o subsystem load may well be smaller).

So there you have it. I don’t think that there is a clear cut case for partitioning according to ETL requirements, and the benefits of partitioning by reporting requirements are certainly enough to compel a close examination of the alternative to loading exclusively by partition exchange. Does this seem like an issue worthy of being rasied as a documentation bug? I’m not sure, but if people think it important then I’d be willing to raise one.

About these ads

7 thoughts on “Choosing Partitioning Keys: For ETL or Reporting?

  1. I would guess that most partition schemes in Oracle DWs are time based (a lot of other DB vendors restrict partitioning to a hash on some key column) I have not seen too many (OK, any!) DWs that use product or store as the main partition key (sub-partitions, maybe) To me it seems natural to use date (time) as the partition key. However I do agree with you that transaction date is often of more use than dataload date. Queries are often by transaction date and data is aged out by transaction date.
    I have at least one DW where we use transaction date instead of load date for partitioning and it works well for us. We also track the distinct transaction dates in the load and use this to drive our stats collection routines

  2. Hoorah for people like you who *consider* as much as a human can consider before making a change. At the end of the day, all that matters is how the client is impacted. The question at the end of the day is:

    Will the lengthier ETL operations of a reporting optimized system be better or worse than the lengtheir reporting queries of an ETL optimized system?

  3. Pete:

    regarding the stats collection, it sounds like you “roll your own” form of Oracle’s monitoring functionality, is that right? What drove that decision, rather than using the monitoring functionality?

    Anon:

    Indeed, and I’d suggest that it’s not beyond reasonable expectation that these changes be roughly quantified. That’s a little trickier for the ETL impact than for the reporting impact I think, but I’d go so far as to suggest that quantifying just the reporting impact of the different partitioning schemes would be very persuasive on its own.

  4. Why roll your own? – It allows me to gather stats for each table how I want – I set the sample size and histograms how I think best (eg compute for the dimensions, estimate 1% for the fact) gather_schema stats does it all the same way.

    Anon – ETL need not be much longer if you partition for reporting rather thn load – it’s a bit more complex but you still can use a lot of go-faster stuff such as direct path inserts.

    SAXPWN – a great word

  5. Did you consider using the monitoring tables to base this on Pete? (ALL_TAB_MODIFICATIONS I guess)

    I like the zero-hassle automatic breakout of the changes by partition and subpartition. Did you choose to log values specifically because of the potential for histogram changes?

  6. I have some DWs where we use gather_stale schema stats. I wrote about ALL_TAB_MODIFICATIONS on Jeff Hunter’s blog a while back – it can be used, it’s just that we knew the partitions that we touched becuase another part of our load functionality recorded it so we might as well use it!

    One point to think about if you partition by transaction date and receive ‘late’ data it is (IMHO) likely to be a whole group of records that have been delayed – say one store fails to send data on time. Wouldn’t that warrant a recollection of stats as the store had been omitted in your geography dimension?

  7. We use transaction date to benefit reporting rather than loading…as has been stated, the only issue is that your loading is a little more complex – one aspect of that being that if you are adopting a read only approach (to assist backups/reduce file checkpointing) to the older historic partitions then what do you do with late arriving data for partitions which are now read only ? We push all records like that to a quarantine area on the load process and then such records will be assessed on a daily basis as to whether they are minor volumes which can be ignored or whether, as Pete says, there might be tons of updates for an old partition because of a big datafix in the operational source system…which you actually want to propagate into your warehouse rather than bin it. (You then have issues as to how you propagate that further downstream in your warehouse – i.e. your normalised/atomic layer and your performance/analytical layer.)

    Life is never dull with warehouses in my experience!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s