Applying Predicates and Partition Pruning to MERGE Target Tables

Just a quick pointer to an answer on the Oracle Forum:  http://forums.oracle.com/forums/thread.jspa?messageID=3912341#3912341

If you are merging into a data warehouse fact table (not so likely) or a large aggregation of a fact table (more likely) then you can improve efficiency by adding a transitive predicate into the USING clause to promote partition pruning on the target table.

Unless you are using Partition Change Tracking I think it is very unlikely that the MERGE inherent to a materialized view refresh is going to partition prune on the target table.

Interviewing: The difference between Delete and Truncate in Oracle

I interviewed a couple of senior ETL developer candidates last week, so I shook the dust off of some of my tried-and-true probes for technical knowledge.

I’ve always had a liking for asking for the differences between “truncate” and “delete” ever since I was asked it myself some years ago. I think it allows candidates to demonstrate a nice range of knowledge.

Here are some answers in no particular order.

  1. “Truncate is fast, delete is slow”. Hmmm, well that’s mostly true I guess. The most basic or all answers.
  2. “Truncate is DDL, delete is DML”. Is this just so obvious that people don’t mention it?
  3. “You can’t rollback a truncate”. Indeed. See 2 above, of course. And 1.
  4. “Truncate is implicitly a commit”. A better answer than 3 I think. I seem to recall that someone once mentioned that there are actually two commits in a truncate, one before and one after … but maybe I dreamed that. I should test it really.
  5. “You can’t grant permission to truncate a table”. Ah, practical experience shines through. If you don’t like your ETL process to connect as the owner of the schema then this is a challenge that has to be overcome with stored procedures or something sophisticated like that. You really don’t want to grant “DROP ANY TABLE” to your ETL user.
  6. “You can delete any subset of rows, but you can only truncate the complete table, or a partition or subpartition of it”. Is this also so obvious that nobody mentions it?
  7. “Truncate makes unusable indexes usable again”. A real gotcha for the unwary. If you attempt to optimise a data load by rendering indexes unusable and truncating a table (possibly followed by an index rebuild and a partition exchange) then be careful of the order.
  8. “Truncate can’t maintain foreign keys”. It’s “cascading delete”, not “cascading truncate”. That would be an interesting feature though — point 4 above would make it a little trickier. Truncating an index cluster is pretty close to a “cascading truncate” to a limited extent though. In any case no truncate is permitted on a table referenced by foreign keys.
  9. “You can’t flashback a truncate”. This is an oddity to me. We can flashback a “drop table”, rollback uncommited deletes, or use flashback to recover pre-commit deleted data, but a truncate is a barrier across which we cannot flashback.
  10. “Truncate deallocates space, delete doesn’t”. Unless you want it not to, using the “reuse storage” clause. However the high water mark is reset in either case so maybe that’s a better answer …
  11. “Truncate resets the high water mark, delete doesn’t”. And on the indexes, also. See 7 above also.
  12. “Truncate allows special treatment of materialized view logs”. More of that practical experience and/or documentation-reading skill shows through.
  13. “DML triggers do not fire on a truncate”. Because … um … it’s DDL not DML.

I suppose there are other issues as well, but this shows that it’s a complex business.  I really wish we could grant “Truncate table” to a user or role … enhancement request time I think.

Isn’t it interesting that in some circumstances only one method can be used to remove all rows from a table in a single step, sometimes both methods can. In other circumstances both methods can remove a subset of rows from a table, and sometimes both can.

Oh well, back to work.

Back to a More Simple Time

If we browse the Oracle Data Warehousing Guide we find a number of interesting database structures and features that can make our warehouses easier to load, faster to query, more simple to manage etc.. Partitioning, parallelsim, bitmap indexes, data segment compression, materialized views and query rewrite, for example. It has been a long, long time since I implemented anything of any size without at least two or three of these.

So let us imagine, gentle reader, a return to the days of simplicity when we had none of these — how would that look? How would it perform? Well it turns out that it would look a lot like the data warehouse that comes out of the box with Siebel Marketing Analytics. No partitioning? Check! Only the most ordinary, uncompressed b-tree indexes? Check! Etc.. You get the idea. And how will it perform? Our lucky band of adventurers is about to find out, as we implement it with the bare minimum of modifications to add some custom data elements from our Siebel CRM system.

Some current areas of concern:

  • A fact table with 50 indexes that in development consume 1.5Gb of space for every 1Gb of fact table data. How is the drop-all-indexes-and-rebuild-after-every-load strategy going to turn out with production sized data?
  • The use of updates against the largest fact table. No data segment compression!
  • No foreign keys and bitmap indexes. Do star transformations even work with that?

 On the plus side we do have a performance testing environment sized at around 80% of production data volumes, so we will have an opportunity to tweak the worst offenders. We do have parallelsim and the licensing for everything that we moght want to throw at the system, and we do have Quest Performance Analysis for Oracle already prepped on the test and production instances.

In preparation for performance testing I have a little mental checklist of what can be implemented quickly and quietly to tackle certain problems, such as replacing b-tree indexes with bitmaps, or a little partitioning, but it’s complexified by the presence of the “Datawarehouse Administration Console” (DAC), an apparantly essential component of the architecture that is responsible for executing the 400+ Informatica workflows that load the staging and warehouse tables. The philosophy around it seems to be that anything platform specific, such as dropping and creating indexes, is defined and controlled through DAC rather than through Informatica (where it is generally a little kludgy, evn if it is serviceable). So we have some exploration to do on that product to see how it handles customisation of indexes (maybe it recreates them the way it found them, but I suspect some customisation is required to make it deal with creating “exotics” such as local bitmap indexes).

We go into integration testing this coming week, with performance testing to follow. Should be an interesting ride.

Data Warehouse Architecture Decisions for Ad hoc and Predefined Reports

First, a disclaimer: the techniques herein represent “blue-sky” thought on an abstract problem. I have not implemented these ideas, although I am very willing to do so given the chance and if nobody can postulate a fatal objection to them. Or a near-fatal one. If you disagree with any of these ideas then you are not contradicting my experience and I’m not going to get all defensive over the matter.

And so to the substance … Continue reading

Presentation of Summaries and Materialized Views

I submitted an abstract for a presentation at the Rocky Mountain OUG Training Days 2007, on the subject of materialized views, summaries, fast refresh and whatnot. Obviously a lot of the substance will come from the material I’ve been posting here for the past year-and-a-bit, emphasising conceptual understanding, analysing the internals of the refresh process, a caveat here and there, the ongoing search for optimisation techniques, and with other material to tie it together as a coherent topic.
Since this is aimed at all levels of experience, from “what is a materialized view” through to “Ah, but in the case of partitioning of a HOLAP-based MV …” what issues would you like to see covered in the talk?

Here is what I have so far, in no particular order …

  • When and what to summarise.
  • Components of a materialized view (table, metadata etc).
  • Query rewrite.
  • Complete and fast refresh.
  • Fast refresh following conventional path, direct path, and partition DDL operations.
  • Propagating changes to the MV data.
  • Performance issues.
  • Alternatives to MV use and tuning manual refresh of summaries.
  • DBMS_Advanced_Rewrite
  • Caveats (32-column limitation on GROUP BY, ORA-12034, PCT and prebuilt tables etc)

A lot of ground to cover, but I’ve optimistically applied for a 90 minute slot — I think I could manage double that at a pinch!

What would you emphasise out of that, and what would you add?

Implied Restriction on Number of GROUP BY Columns in Materialized View

pranavgovind hit this problem, as described on the DBA Support.com forum, wherein Oracle ‘s attempt to index all the columns of an aggregating materialized view fails when more than 32 columns are included in the GROUP BY, and the error “ORA-01793: maximum number of index columns is 32″ is raised.

I added my own thoughts there, but this is symptomatic of some more general failings in the area of materialized views in which the creation and refresh codes do not take as much advantage of chances to optimise as a human could.

For example, consider a materialized view built against a fact table joined to a date hierarchy, so as to define the MV query like this. In such a case only the date_of_month column needs to be indexed — the inclusion of any other columns represents a waste of resources. In fact in this case the index ought to be declared to be UNIQUE, whether or not the Sys_Op_Map_NonNull() function is used on the column in order to be compatible with Oracle’s recursive SQL for refreshing materialized views. However Oracle would wastefully index all four of the GROUP BY columns.

Compensating for Poor Data Warehouse Hardware

Here is a thought that is brought on by this posting at the Dizwell forum.

The question seem to me to boil down to this essential issue: “Can a large memory allocation compensate for poor I/O performance in a data warehouse?”. It also seems to me that the answer to this question is a partly-qualified “No”, for the following reasons. Continue reading

The Three Pillars of Oracle Data Warehousing

Introduction

This is a basic topic for Oracle data warehousing beginners, based on some ideas that I’m hoping will stop buzzing round in my head if I commit them to virtual paper.

There are three Oracle features that provide a foundation for successful data warehousing:

  • Partitioning
  • Parallelism
  • Psummary Tables (the “p” is silent)

Here are the benefits that they bring to the system. Continue reading

A Quick Materialized View Performance Note

I was asked in an email this morning about how to improve materialized view refresh performance, and it’s something that appears in Google hits for the blog quite frequently. So I thought I’d write a quick-and-dirty posting to summarise some of the research (and the resulting practical applications) that I’ve been fiddling with recently.

Here are some bullet points:

  1. The materialized view fast refresh mechanism is a one-size-fits-all solution, and is probably not efficient for 99% of summary table maintenance operations.
  2. The join of the aggregated change data to the MV is function-based, as the columns of both relations are wrapped in the Sys_Op_Map_NonNull() function that allows “null = null” joins. I think that it is extremely unlikely that anyone has nullable attribute columns in their fact or summary tables, so this (and the composite function-based index required to support it) are a waste of resources.
  3. Because of the nature of the join it seems to be extremely unlikely that partition pruning of the summary table could take place.
  4. The join mechanism promotes nested loop joins, where a hash join is probably more efficient (that’s technically an outer join in the merge, of course).
  5. The refresh mechanism assumes that a merge will be required, when sometimes an insert is not only possible but is very much more efficient.

If performance (and robustness, IMHO) are an issue for you then I would advise that you do the following:

  1. Use materialized views only for enabling query rewrite (which means creating them on a prebuilt summary table, and unless you are loading to the summary by partition exchange then you have to drop the MV, refresh the summary table, and recreate the MV). In 10g it is much more easy to use the DBMS_Advanced_Rewrite package instead of MV’s.
  2. Write your own refresh code, based on the usual principles of writing good SQL. If you don’t need a merge then don’t use it. If you don’t need to join to dimension tables to get higher attributes then don’t do it.
  3. Leverage different levels of aggregation to help produce higher levels. For a series of hierarchical summaries, multi-level aggregations can be extremely beneficial.
  4. Consider storing the refresh and MV definition SQL in CLOB columns of a summary management table, so they can be tuned and edited without needing to open up package code to do so.
  5. Consider using a complete refresh, either through MV’s or manually, for higher aggregation levels, particularly when you can reference another summary table to do so.

My practical experience of taking this path delivered a reduction in the refresh time of a set of seven materialized views based on a single fact table, from a couple of hours down to six minutes. The publication of this is an example of what my client’s technical lead calls “polishing the crown” :D

Anyway, this is all a sort-of abstract from a whitepaper that I’m working on right now, which will include all sorts of proofs and demonstrations of the above, plus some process and risk management advice, but it doesn’t look like it’ll be ready very soon. Feel free to comment or ask questions on the points raised above though — I don’t want anyone to be kept in suspenders ’till the paper is finished.

Bad Advice On Materialized View Deferrable Constraints

An article by Jonathan Lewis on constraint problems with materialized views set off a little bell in my head about a metalink document on the subject, Note 284101.1.

The note protests the use of unique, foreign key and primary key constraints on materialized views, and includes the following interesting phrases …

It is not a good idea to create unique (unique indexes ) constraints , primary key or Foreign key constraints on the snapshot base table … There should not be any Unique/PK constraints at Mview base table … Drop the Unique/PK constriants on Mview Base Tables.”

Then one of the workarounds is “You can have Deferred Constraints“. The document doesn’t identify why having deferred constraints might be a bad idea, or why not having integrity constraints on the table at all is preferable to having them deferred .. I can’t think of a reason why that might be so, and I can think of several reasons why it might not be so (query optimization being the prime one).

All this misdirection and ambiguity, combined with the abbreviation of an error code from “ORA-00001″ to “ORA-1″ in the title of a help document (does that make sense to anyone?) make this a pretty poor piece of work. It’s not the Most Useless Metalink Article Ever, but it’s close.

By the way, if you read the title and the first few words of this article, and thought “Dave’s going to say that Jonathan Lewis gave bad advice!” before realising I wasn’t … the effect was intentional :D If you didn’t think that then my plan failed.