Applying Predicates and Partition Pruning to MERGE Target Tables

Just a quick pointer to an answer on the Oracle Forum:

If you are merging into a data warehouse fact table (not so likely) or a large aggregation of a fact table (more likely) then you can improve efficiency by adding a transitive predicate into the USING clause to promote partition pruning on the target table.

Unless you are using Partition Change Tracking I think it is very unlikely that the MERGE inherent to a materialized view refresh is going to partition prune on the target table.


Interviewing: The difference between Delete and Truncate in Oracle

I interviewed a couple of senior ETL developer candidates last week, so I shook the dust off of some of my tried-and-true probes for technical knowledge.

I’ve always had a liking for asking for the differences between “truncate” and “delete” ever since I was asked it myself some years ago. I think it allows candidates to demonstrate a nice range of knowledge.

Here are some answers in no particular order.

  1. “Truncate is fast, delete is slow”. Hmmm, well that’s mostly true I guess. The most basic or all answers.
  2. “Truncate is DDL, delete is DML”. Is this just so obvious that people don’t mention it?
  3. “You can’t rollback a truncate”. Indeed. See 2 above, of course. And 1.
  4. “Truncate is implicitly a commit”. A better answer than 3 I think. I seem to recall that someone once mentioned that there are actually two commits in a truncate, one before and one after … but maybe I dreamed that. I should test it really.
  5. “You can’t grant permission to truncate a table”. Ah, practical experience shines through. If you don’t like your ETL process to connect as the owner of the schema then this is a challenge that has to be overcome with stored procedures or something sophisticated like that. You really don’t want to grant “DROP ANY TABLE” to your ETL user.
  6. “You can delete any subset of rows, but you can only truncate the complete table, or a partition or subpartition of it”. Is this also so obvious that nobody mentions it?
  7. “Truncate makes unusable indexes usable again”. A real gotcha for the unwary. If you attempt to optimise a data load by rendering indexes unusable and truncating a table (possibly followed by an index rebuild and a partition exchange) then be careful of the order.
  8. “Truncate can’t maintain foreign keys”. It’s “cascading delete”, not “cascading truncate”. That would be an interesting feature though — point 4 above would make it a little trickier. Truncating an index cluster is pretty close to a “cascading truncate” to a limited extent though. In any case no truncate is permitted on a table referenced by foreign keys.
  9. “You can’t flashback a truncate”. This is an oddity to me. We can flashback a “drop table”, rollback uncommited deletes, or use flashback to recover pre-commit deleted data, but a truncate is a barrier across which we cannot flashback.
  10. “Truncate deallocates space, delete doesn’t”. Unless you want it not to, using the “reuse storage” clause. However the high water mark is reset in either case so maybe that’s a better answer …
  11. “Truncate resets the high water mark, delete doesn’t”. And on the indexes, also. See 7 above also.
  12. “Truncate allows special treatment of materialized view logs”. More of that practical experience and/or documentation-reading skill shows through.
  13. “DML triggers do not fire on a truncate”. Because … um … it’s DDL not DML.

I suppose there are other issues as well, but this shows that it’s a complex business.  I really wish we could grant “Truncate table” to a user or role … enhancement request time I think.

Isn’t it interesting that in some circumstances only one method can be used to remove all rows from a table in a single step, sometimes both methods can. In other circumstances both methods can remove a subset of rows from a table, and sometimes both can.

Oh well, back to work.

Back to a More Simple Time

If we browse the Oracle Data Warehousing Guide we find a number of interesting database structures and features that can make our warehouses easier to load, faster to query, more simple to manage etc.. Partitioning, parallelsim, bitmap indexes, data segment compression, materialized views and query rewrite, for example. It has been a long, long time since I implemented anything of any size without at least two or three of these.

So let us imagine, gentle reader, a return to the days of simplicity when we had none of these — how would that look? How would it perform? Well it turns out that it would look a lot like the data warehouse that comes out of the box with Siebel Marketing Analytics. No partitioning? Check! Only the most ordinary, uncompressed b-tree indexes? Check! Etc.. You get the idea. And how will it perform? Our lucky band of adventurers is about to find out, as we implement it with the bare minimum of modifications to add some custom data elements from our Siebel CRM system.

Some current areas of concern:

  • A fact table with 50 indexes that in development consume 1.5Gb of space for every 1Gb of fact table data. How is the drop-all-indexes-and-rebuild-after-every-load strategy going to turn out with production sized data?
  • The use of updates against the largest fact table. No data segment compression!
  • No foreign keys and bitmap indexes. Do star transformations even work with that?

 On the plus side we do have a performance testing environment sized at around 80% of production data volumes, so we will have an opportunity to tweak the worst offenders. We do have parallelsim and the licensing for everything that we moght want to throw at the system, and we do have Quest Performance Analysis for Oracle already prepped on the test and production instances.

In preparation for performance testing I have a little mental checklist of what can be implemented quickly and quietly to tackle certain problems, such as replacing b-tree indexes with bitmaps, or a little partitioning, but it’s complexified by the presence of the “Datawarehouse Administration Console” (DAC), an apparantly essential component of the architecture that is responsible for executing the 400+ Informatica workflows that load the staging and warehouse tables. The philosophy around it seems to be that anything platform specific, such as dropping and creating indexes, is defined and controlled through DAC rather than through Informatica (where it is generally a little kludgy, evn if it is serviceable). So we have some exploration to do on that product to see how it handles customisation of indexes (maybe it recreates them the way it found them, but I suspect some customisation is required to make it deal with creating “exotics” such as local bitmap indexes).

We go into integration testing this coming week, with performance testing to follow. Should be an interesting ride.

Data Warehouse Architecture Decisions for Ad hoc and Predefined Reports

First, a disclaimer: the techniques herein represent “blue-sky” thought on an abstract problem. I have not implemented these ideas, although I am very willing to do so given the chance and if nobody can postulate a fatal objection to them. Or a near-fatal one. If you disagree with any of these ideas then you are not contradicting my experience and I’m not going to get all defensive over the matter.

And so to the substance … Continue reading

Presentation of Summaries and Materialized Views

I submitted an abstract for a presentation at the Rocky Mountain OUG Training Days 2007, on the subject of materialized views, summaries, fast refresh and whatnot. Obviously a lot of the substance will come from the material I’ve been posting here for the past year-and-a-bit, emphasising conceptual understanding, analysing the internals of the refresh process, a caveat here and there, the ongoing search for optimisation techniques, and with other material to tie it together as a coherent topic.
Since this is aimed at all levels of experience, from “what is a materialized view” through to “Ah, but in the case of partitioning of a HOLAP-based MV …” what issues would you like to see covered in the talk?

Here is what I have so far, in no particular order …

  • When and what to summarise.
  • Components of a materialized view (table, metadata etc).
  • Query rewrite.
  • Complete and fast refresh.
  • Fast refresh following conventional path, direct path, and partition DDL operations.
  • Propagating changes to the MV data.
  • Performance issues.
  • Alternatives to MV use and tuning manual refresh of summaries.
  • DBMS_Advanced_Rewrite
  • Caveats (32-column limitation on GROUP BY, ORA-12034, PCT and prebuilt tables etc)

A lot of ground to cover, but I’ve optimistically applied for a 90 minute slot — I think I could manage double that at a pinch!

What would you emphasise out of that, and what would you add?

Implied Restriction on Number of GROUP BY Columns in Materialized View

pranavgovind hit this problem, as described on the DBA forum, wherein Oracle ‘s attempt to index all the columns of an aggregating materialized view fails when more than 32 columns are included in the GROUP BY, and the error “ORA-01793: maximum number of index columns is 32” is raised.

I added my own thoughts there, but this is symptomatic of some more general failings in the area of materialized views in which the creation and refresh codes do not take as much advantage of chances to optimise as a human could.

For example, consider a materialized view built against a fact table joined to a date hierarchy, so as to define the MV query like this. In such a case only the date_of_month column needs to be indexed — the inclusion of any other columns represents a waste of resources. In fact in this case the index ought to be declared to be UNIQUE, whether or not the Sys_Op_Map_NonNull() function is used on the column in order to be compatible with Oracle’s recursive SQL for refreshing materialized views. However Oracle would wastefully index all four of the GROUP BY columns.

Compensating for Poor Data Warehouse Hardware

Here is a thought that is brought on by this posting at the Dizwell forum.

The question seem to me to boil down to this essential issue: “Can a large memory allocation compensate for poor I/O performance in a data warehouse?”. It also seems to me that the answer to this question is a partly-qualified “No”, for the following reasons. Continue reading