Data Warehouse Architecture Decisions for Ad hoc and Predefined Reports

First, a disclaimer: the techniques herein represent “blue-sky” thought on an abstract problem. I have not implemented these ideas, although I am very willing to do so given the chance and if nobody can postulate a fatal objection to them. Or a near-fatal one. If you disagree with any of these ideas then you are not contradicting my experience and I’m not going to get all defensive over the matter.

And so to the substance …

A Requirement for Ad Hoc Reports 

A requirement crossed my path the other day that provoked some thoughts on the significance of ad hoc and predefined reports. The central issue is that the requirement is purposefully vague in supplying samples of predefined report, and that this can pose particular problems in the design of an Oracle data warehouse architecture in which ad hoc reports have to cohabit with predefined reports.

The reason for this is that I regard stability of performance for predefined reports as being a highly desirable design goal. For a great many users the apparant health and performance of the system is going to be defined by the runtime of those reports, and if a normal runtime of around four seconds extends to greater than thirty seconds on occasion then that is a significant matter. Among those users I’d count some of the most important ones in the organisation, the executives and senior staff who login to check high-level trends and operational data and who may be *ahem* associated with your operating budget for the next year.

Therefore samples of required reports and an estimate of their popularity (executions per day or whatever) are an important component of the design requirement that I insist on knowing, as far as it is within my powers to do so, early on in the development cycle. This information gives insight into a number of key areas:

  1. Filter predicates to tune the physical storage for: in particular what partitioning/subpartitioning scheme to use to optimise the use of partition pruning.
  2. Granularity for predefined result sets: what materialized views/summary tables/transformation tables to define during development and manage during the ETL process. Providing a precalculated result set that is an exact match on the granularity of a report is an enormous performance boost.
  3. Opportunities for providing metric combinations: often it seems that complex metrics such as (metricA+metricB)/metricC do not make it into the requirement as the users intend to calculate them dynamically. This can be a problem with Business Objects as you cannot currently define in a universe a metric from multiple contexts — in other words across chasm traps (multiple star schemas) or fan traps (multiple tables in 3NF). This can be a real problem, and the moral is “know your query tool”.

The above issues can be very difficult to backport into an inprocess development, item 3 in particular as it may mean new tables, new ETL processes and extensive universe redesign. Item 2 can be a large chunk of work to add in to a development for which you have already budgeted time and personnel. Item 1 is not too bad unless it means a fundamental shift in the data loading methodology (eg. from partition exchange to direct path insert because your ETL granularity no longer matches your reporting granularity). Actually, that’s quite a likely occurance because in the absence of any other guidance it generally makes sense to partition by load cycle.

The reason for the absence of predefined report is understandable in this case, however. The development is intended to allow users to explore a couple of related data sets in as raw a form as possible. The data is already merged and cleansed into a standard fact table (multicolumn range/list subpartitioned as it happens) but not every one of the seemingly hundreds of thousands of columns is used. I exaggerate there, slightly.

Without guidance on reporting requirements we can make some guesses in this case, educated by our current experience with the use of the clean data. This gives us a commonly predicated date attribute (not matching the granularity of the load, unfortunately) and maybe one or two other attributes on which we can multicolumn and/or subpartition.

However the exploration aspect of the requirement means that we could be facing, for example, a scan of every row to find the transactions with an extended cost of more than $30 million. Do we (btree) index for that, or risk the possibility of full table scans eating our i/o bandwidth lunch? Maybe we should create a summary table with the extremely large extended amount transactions in it, as there would only be a handful, and let query rewrite redirect to it? If an aggregation is required for claculating extended cost then a precalculated result set is our only hope.

But what about the other hundred columns? If we index the wrong column then we can be facing index merges and range scans that would spike the logical i/o rate and eat the cpu’s up. Quite the dilemma, and the jury is still out on the design details for the case in hand, so let me make some general observations.

Let us suppose that the development does turn out to be truly ad hoc, with a high proportion of queries that we cannot predict. What are the implications for the system architecture?

The Implications of Ad Hoc Report Requirements

As I implied near the beginning, when you have sample reports to base your work on then you can tune the design to exactly match those reports. As long as you have a flexible framework for maintaining predefined result sets then you can expand them to match further reporting requirements that might be revealed late in the design stage or in early user testing or post-deployment. Performance guarantees ought therefore to be very achievable for almost all predefined reports, and stability of performance then becomes a priority.

However as soon as there is an extensive proportion of ad-hoc queries for which you have not, and maybe cannot, tune then it is entirely possible that one or two of them can overwhelm your system. In comparison to the partition scan and join (often without aggregation) achievable with predefined result sets the resource consumption of a full scan plus joins plus metric aggregation on an entire fact table is enormous. Without some ameliorating measures the CPU and i/o requirements alone can bring the system to a crawl and spoil the day of those Cooperative Citizens who are expecting their regular reports to return inside of the usual four seconds that they need to maintain their attention span.

Mitigating the Impact

So there are three aspects to managing a balance between good ad-hoc performance and good predefined report performance:

  • Memory resources
  • I/O resources
  • CPU resources

Let us now discuss them a little further.

Memory Resources

In terms of the competition for limited memory resources we have on the one hand the requirement that we allow sufficient memory for caching precalculated result tables in the SGA. Normally the SGA is modestly proportioned in comparison to the PGA because we use parallel query to scan the fact tables, and parallel query uses direct path reads that bypass the buffer cache. We therefore allocate memory to the PGA to allow for large sort areas for aggregations, joins to large dimension tables, and index build activity (the latter being associated with ETL operations rather than reporting, of course). However the heavily used precalculated result tables are often small enough to be accessed serially and consequently can benefit from caching in the SGA. Therefore a system that relies heavily on such tables is likely to benefit from a larger allocation of SGA at the expense of PGA. How much larger obviously depends on the exact circumstances, but the trend is there nevertheless.

In an ad hoc environment we may be making extreme demands on the PGA. Although the memory demand for an aggregation is mostly dependent upon the size of the final result set (so a report based on “day” is likely to take at least an order of magnitude more memory than a similar report on “month” for the same period) the size is generally limited by how many rows the user wishes to see. A sensible (ahem) user is not going to produce an adhoc report based on all five million transactions in the past year because the display of such a report would never finish (of course, this isn’t a Golden Rule or anything). However a user might choose to display the transactions with an extended amount in a given range, knowing that the result set will be but a few dozen or hundred rows. If the calculation of that report involves an aggregation and the application of a HAVING clause then you may truly be in trouble, with the need to display only a few rows but with an enormous memory requirement for calculating the aggregation upon which a filter will then be applied. The only way to reduce this load is to supply a precalculated result set that exactly matches the granularity of the required report, anmd which maybe represents only a subset of interest — if you get close to but not exactly on the granularity then you’ll incur the same potentially-devastating sort requirement but will have saved yourself some i/o in reading the base data and CPU in joining a reduced number of rows prior to the aggregation.

This of course emphasises the need for gathering sample reports, and in particular emphasises the need to probe for this kind of reporting requirement even if the full report specification is not known.

To summarise, here are some available techniques for reducing the memory impact of ad-hoc queries:

  1. Provide a huge amount of memory.
  2. Provide very fast devices for temporary disk space.
  3. Anticipate huge sort requirements with exact-match precalculated result (sub)sets.
  4. Prevent execution of high memory usage queries.

I/O Resources

If we can adequately cache precalculated result sets for commonly executed reports then we have already gone a long way towards decoupling adhoc and predefined reports’ i/o resources. What if we cannot cache more than a relatively small percent of the predefined sets, and the stability of predefined report execution times is likely to be dependent upon i/o load? Well I’m no Resource Manager expert by a long chalk but it seems to me (and it certainly used to be) that there is no mechanism in RM for controlling i/o allocation between resource consumer groups. How then to prevent a high bandwidth adhoc query from interfering with predefined reports?

Aside from standard techniques such as the use of data segment compression, there are a number of possibilities.

One technique is to control (ie. limit) the degree of parallelism on the tables subject to ad hoc query, through limitations on the default degree for the tables and through parallel_adaptive_multi_user. This would not prevent ad hoc queries affecting predefined queries, but would limit the degree to which they could be affected. However it would also be likely to act as an artificial brake on ad hoc query performance without providing the high degree of performance stability that is desirable for the predefined reports.

Another technique, one that I’m rather leaning towards at the moment, is to isolate result sets for predefined reports onto their own set of devices and controllers — or whatever the SAN equivalent of that would be. Again this imposes an upper limit on the performance of the ad hoc queries and something of a limit on the predefined queries, however the decoupling in respect of i/o usage is as near to perfect as we could wish for, and there is the additional bonus that the majority of ETL-related activity can also be pushed to the ad-hoc side of the i/o divide.

If we were to go with such a method then how would be choose the breakpoint for what segments would be placed where? it seems that it would be useful to categorise the predefined result sets into three categories. Taking the example of a system with a 2Gb data buffer we might choose the following:

  1. Small, frequently accessed sets that can be cached in the KEEP pool. Because we expect almost no i/o on these segments they can be placed on either side of the divide. Say 1Gb worth, that can be scanned and cached into the data buffers on startup.
  2. Larger, frequently accessed sets that incur moderate i/o. Isolated from the ad-hoc query side of the i/o divide to render them less prone to ad hoc query-related interference. A low degree of parallelism might be beneficial here.
  3. The largest least frequently accessed segments that may incur high i/o requirements. On the ad-hoc side of the divide they are prone to interference by other queries but can still provide performance benefits through precalculation.

Whether it would be necessary to make the distinction between  the first and second categories is probably arguable. One might have a situation of wanting to provide multiple levels of stability for the predefined queries, for example providing the highest level of stability to an executive dashboard data set and a second level of lesser stability for the ordinary Joe User executing predefined reports on more detailed data. This could improve the distinction between the categories and make the need for separate treatment more realistic.

At the moment the i/o divide approach is the one I favour for situations requiring a high degree of performance stability for predefined reports, which I suggest is just about all situations — I expect that the Automatic Storage Management (ASM) component of 10g could play an important roll in allowing the i/o divide to be moved about should long term trends make it desirable, but I haven’t thought much about that, not being an ASM expert or anything.

CPU Resources

Of the three categories of resource that we want to decouple ad hoc and predefined report performance on, CPU usage is in some ways the easiest as there is an Oracle-supplied feature specifically intended for this purposes — the aforementioned Resource Manager. And a good job it is too, because I cannot think of a method for allocating CPU usage without it. I shall content myself by directing you to the documentation to learn more.

The interesting part of using Resource Manager in this context, it seems to me, is interfacing between the reporting environment and the database to enable it to be used. In the case of an executive dashboard that supplies no ad hoc reporting capability then it can be sufficient to connect as a particular database user which is assigned to a particular resource consumer group having a high priority for resources. Similarly, for a 100% ad hoc environment (say a Business Objects Universe dedicated to that purpose for example) this method allows the associated queries to be executed with the lowest resource priority.

The tricky part is in the middle ground — reporting environments that allow drill-down to ad hoc levels from predefined reports. The predefined reports have high priority, the ad hoc reports are to have low priority. Again, I’m no Resource Manager expert, but it has a documented functionality to “Allow automatic switching of users from one group to another group based on administrator defined criteria. If a member of a particular group of users creates a session that executes for longer than a specified amount of time, that session can be automatically switched to another group of users with different resource requirements“. So through the use of Resource Plan Directives we can specify a low value for switch_time_in_call after which the sessions’ priority for resources can be downgraded. Presumably this can be cascaded to multiple levels of priority so that the query’s resource usage can be cut back after five seconds, then again after twenty, then again after one minute etc.. I have no firm knowledge of that though.

This has a further advantage of allowing ETL operations to be brought within the remit of resource allocation.

Other Methods

Time Windowing

If there are particular periods during which the rapid execution of predefined reports is to be a priority (for example on Monday mornings before 10am, or at the end of months or fiscal quarters) then the execution of ad hoc reports could be severely curtailed during this time. This could be achieved through functionality on the BI tool or through Resource Manager or the use of profiles or revoking permission on userid’s, off the top of my head. Resource Manager would probably be the more graceful way, by implementing a different multiple resource plan during the time window.

BI Tool Caching

It is quite possible to isolate many predefined reports on the BI layer by triggering a stored result set to be cached as part of a report that can be accessed without reference to the database. Cognos, of course, can implement this, and so can Business Objects. I’m not a huge fan of this method, requiring as it does an expansion in the hardware requirements of the application layer. I also have doubts over it’s advisability for predefined reports representing a small fraction of a total data set — eg. a report on product sales for a single month for one product out of 1,000 and one month out of 60. Whether the BI layer can efficiently isolate this data set is very tool dependent and Oracle may provide much better performance at a lower resource cost.

RAC Techniques

Again, I’m no expert blah blah blah, but perhaps a single RAC instance out of three or four could be dedicated to predefined reports and the rest to ad hoc queries and ETL processing. Maybe. I really have no idea.

Summary

An environment that mixes ad hoc and predefined reports faces a particular challenge: providing the best performance possible for ad hoc reports while respecting the need for stability in the performance of predefined reports.

Achieving this aim may be best served by a combination of the following techniques:

  • Isolation of precalculated result sets onto a separate i/o path.
  • Use of Resource Manager to allocate CPU resources between query types.
  • Implementation of targeted precalculated result sets to mitigate high memory usage on filtered aggregation reports.
About these ads

2 thoughts on “Data Warehouse Architecture Decisions for Ad hoc and Predefined Reports

  1. Why stop at I/O? Why not make ad-hoc users buy a super-micro and snap their dataset on a SAN? Give ‘em their own [SP]GA and cpu’s and let ‘em ride.

  2. I, sir, stop at nothing.

    But the tricky part of isolating the ad hoc data is managing drilldown to that data set from predefined reports against precalculated result sets. It sounds a bit pricey hardware-wise, as well.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s