Compensating for Poor Data Warehouse Hardware

Here is a thought that is brought on by this posting at the Dizwell forum.

The question seem to me to boil down to this essential issue: “Can a large memory allocation compensate for poor I/O performance in a data warehouse?”. It also seems to me that the answer to this question is a partly-qualified “No”, for the following reasons.

Firstly, parallel query reads blocks directly from disk, and does not look for them in the buffer cache. Serial reads would therefore be required to populate the buffer cache, and a serial process with wait times of between 180 and 250ms and multiblock reads of 1MB (to take a generous number) is going to read around 5-6MB/sec — at that rate it would take in the order of two and a half hours to populate the buffer cache.

Secondly there is little hope of caching a respectable portion of the fact tables of a 1.4 TB data warehouse. A 50GB buffer cache only represents a few percent of the total data set.

So if a disk upgrade is not feasible, what can be done to provide acceptable performance? My advice in this situation would be to throw all of the team’s efforts into the creation of summary tables. In fact I’d suggest that this is a critical issue — there is no way that the system as described is going to be usable without the enormous reduction in required disk bandwidth that summary tables represent.

Now I also believe that what is required in a system where summary tables represent a critical performance enhancement is an entire structure for their creation and maintenance, comprising the following elements:

  1. Identify opportunities for summarisation
  2. Define summary queries
  3. Create the summary and its supporting maintenance code
  4. Monitor the summary usage

There are some automated tools in Oracle for handling the first two stages, for example through the DBMS_OLAP.Recommend_MView_Strategy procedure. However I haven’t used them, and have prefered instead to use reporting metrics to find reports that generally take a long time to execute or are executed frequently (the metric of report executions multiplied by report duration is a handy one), then looked for opportunities for providing summary definitions to improve their performance.

For efficient summary maintenance I default to hand-crafted approaches, with SQL statements stored in CLOB columns of an MV metadata table for handling the loading of new data, rather than relying on the standard materialized view fast refresh mechanisms. There are a few articles on the blog about this already.

In terms of monitoring the summaries I like to use segment statistics. There is generally a rough correlation between the logical i/o’s required against a summary table and those that would have been required against the master table(s), and an “effectiveness ratio” can be estimated by comparing the two — just use a hint to prevent rewrite against the summary table for a sample of rewritable queries and see how much work is being saved. In the case of cached summary data versus uncached fact table data where the disk subsystem is known to be very poorly performing the absolute number of physical reads can be more significant. Another consideration is that a given summary table may provide extremely high savings in comparison to a non-rewritten execution but only marginal savings in comparison to a rewrite against a different summary. In that case the value of maintaining the summary might be questionable.

So, the qualification to the original answer of “No” is due to the much smaller size of the summary tables, which make it more feasible to avoid using parallel query against them, and thus there are benefits to be gained from a larger buffer cache. However as a guide to right-sizing the cache I would be relying much more on Oracle’s built in buffer cache advice functionality.

About these ads

13 thoughts on “Compensating for Poor Data Warehouse Hardware

  1. I’d also like to suggest giving as much memory to PGA as possible – if some of the sorts, hashes and merges can stay in memory rather than get written to temp (especially on slow disk) the better

  2. I am the original poster of the question on Dizwell forum. I have couple of more questions. Most queries against this database run for hours. The AWR snapshots are captured very 15 minutes. The AWR reports have the buffer cache advisory. Can I use advisories from the AWR reports to size SGA/PGA?

    In this kind of scenario, would increasing SGA/PGA help speed up Informatica ETL process?

    Thanks

  3. I don’t see why not, but I’d be interested in knowing what the reports are telling you, and also whether you are using parallelism or not. also, do you make much use of partitioning, and if so are you fully leveraging partition pruning?

    Whether ETL would be speeded up or not is very difficult to say. Much depends on the way that Informatica uses the database — whether you tend to store temporary result sets in database tables or not, for example. One of the issues that Informatica users tend to suffer from is believing that frequent commits will improve performance, whereas you are in fact more likely to suffer from log sync waits and poor performance.

  4. … and let me reaffirm my previous advice — once you have witnessed one of those multihour reports executing in less than a second becuase the query was rewritten against a summary table, you will never look back. It’s like crack cocaine … I expect.

  5. I looked at the reports for a typically busy day. The buffer pool and SGA advisories did not indicate any need for increasing the allocation, the reduction in physical IO being less than 5% even on doubling the buffer pool/SGA. The PGA advisory however did indicate several multi-pass executions in every report. The advisory indicated that the cache hit would be 77% if I increased the PGA to 8GB. Currently the PGA is less than a GB so the report did not go beyond 8GB. Maybe I will recommend to increase PGA rather than SGA.

    This database has tons of parallel queries. My opinion is that given the IO bottleneck, parallelism might actually be hurting us since we are overloading the already bandwidth short disks.

    The really big tables are partitioned and the queries mostly go against the partition for current month. Queries rarely ever go across months so I guess that partition pruning would be used.

    I am not familiar with summary tables, neither is anyone else. I will study, maybe create a test case and demonstrate.

    Informatica is using local disk for storing temporary result sets and that also is hurting performance. Before moving to Windows, Informatica and DW database were on the same server. Informatica had full power of a Sun partition with 8 (64 bit) CPUs, 16GB memory, fastest possible SAN attached disks. Now it is constrained to a 2 (32 bit) CPU box, 2GB memory, going to the database across a network link and a local disk. I could be wrong but there is no way that the performance can compare between these architectures.

    Thanks for your help.

  6. I think that your findings on the buffer cache advice and the PGA advice are entirely in line with expectations, and increasing the PGA ought to give you some decent benefits.

    With regard to the partitioning, don’t be afraid to experiment with multicolumn range and composite range-list partitioning to get even further benefits. If you have queries that regularly require some fraction of an entire partition then look for ways to define smaller partitions or break the partitions into subpartitions — regularly accessing between 10% and 50% of a partition would be a good sign that more could be done there.

    For information on the dangers of over-parallelising, hop over to Doug Burns blog (see the link in my sidebar) and look at some of his work on analyzing the limits on the benefits of parallelism.

    Summary tables, in brief, are pre-calculated aggregations of detailed data, so that the optimizer can use a functionality called “query rewrite” to divert the query to such a summary instead of the detailed data. The performance benefits can be extraordinary, as a high-levelquery (eg. “Show me the total sales by region and month”) might have to scan only a few kilobytes of data instead of many megabytes or gigabytes. Click on the sidebar category “Oracle Materialized Views” for some of my own articles, and review the information on materialized views in the Oracle Data Warehousing Guide. If you are using 10g then strongly consider using the functionality of the DBMS_ADVANCED_REWRITE package to define query rewrite against regular tables, instead of using materialized views. You will have to write your own ETL code to maintain the summaries, but that’s fairly straightforward — nothing magical about it.

    I don’t suppose you can leverage some of those reduced Informatica licensing fees to get new hardware? :D

  7. Hi
    Recently i have hit a problem on informatica with Oracle.
    Error prompt as CMN_022 Ora-12500 listener failed to start a dedicated server process.

    i try searching on google but i can’t find any thing good clue on how it work . basically it mention abt reduce oracle memory for OS. Because Informatica need all connection to be establish once it initial.

    It is an intermette problem which only happen during 1pm..

    Did you guy have any problem ?

  8. ORA-12500 — I don’t recall getting that error myself but it des appear to be related to overloading the server. Do you have Informatica and Oracle on the same box?

  9. I’d also like to suggest giving as much memory to PGA as possible – if some of the sorts, hashes and merges can stay in memory rather than get written to temp (especially on slow disk) the better

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s