The Oracle Sponge

Oracle Data Warehouse Design and Architecture

Archive for March, 2008

Multitable Insert and Distinct Values

Posted by David Aldridge on 2008-03-27

I puzzled a little on the way to work about efficiently populating a hierarchy table based on a flat dimension. (“Why?” … seriously, you just don’t want to know that).

So let us suppose you have a table like this:

create table dim
  (city varchar2(30),
  state varchar2(30),
  region varchar2(30),
  country varchar2(30))
/

… and data like this … 

insert into dim values ('Fairfax'         ,'Virginia','US-East'   ,'USA'  );
insert into dim values ('Merrifield'      ,'Virginia','US-East'   ,'USA'  );
insert into dim values ('Alexandria'      ,'Virginia','US-East'   ,'USA'  );
insert into dim values ('Portland'        ,'Maine'   ,'US-East'   ,'USA'  );
insert into dim values ('Colorado Springs','Colorado','US-Central','USA'  );
insert into dim values ('Denver'          ,'Colorado','US-Central','USA'  );
insert into dim values ('Longmont'        ,'Colorado','US-Central','USA'  );
insert into dim values ('Salt Lake City'  ,'Utah'    ,'US-Central','USA'  );
insert into dim values ('Rome'            ,'Lazio'   ,'IT-Central','Italy');

… and you want to turn it into a hierarchy in the following table …

create table hier
(attribute varchar2(30),
parent varchar2(30),
lvl varchar2(30))
/

You could do it as follows:

insert into hier
select          city   , state  , 'City'    from dim
union all
select distinct state  , region , 'State'   from dim
union all
select distinct region , country, 'Region'  from dim
union all
select distinct country, null   , 'Country' from dim
/

No distinct on the first select clause because we assume that City is unique (it would be if it were the PK on a real dimension, of course). In the absence of indexes to support fast full index scans we have four full table scans and three operations to support the distinct operator …

----------------------------------------------------------------------------                                                                                                                                                                                                                                
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |                                                                                                                                                                                                                                
----------------------------------------------------------------------------                                                                                                                                                                                                                                
|   0 | INSERT STATEMENT    |      |    25 |   418 |    11  (82)| 00:00:01 |                                                                                                                                                                                                                                
|   1 |  UNION-ALL          |      |       |       |            |          |                                                                                                                                                                                                                                
|   2 |   TABLE ACCESS FULL | DIM  |     9 |   171 |     2   (0)| 00:00:01 |                                                                                                                                                                                                                                
|   3 |   HASH UNIQUE       |      |     9 |   162 |     3  (34)| 00:00:01 |                                                                                                                                                                                                                                
|   4 |    TABLE ACCESS FULL| DIM  |     9 |   162 |     2   (0)| 00:00:01 |                                                                                                                                                                                                                                
|   5 |   HASH UNIQUE       |      |     5 |    75 |     3  (34)| 00:00:01 |                                                                                                                                                                                                                                
|   6 |    TABLE ACCESS FULL| DIM  |     9 |   135 |     2   (0)| 00:00:01 |                                                                                                                                                                                                                                
|   7 |   HASH UNIQUE       |      |     2 |    10 |     3  (34)| 00:00:01 |                                                                                                                                                                                                                                
|   8 |    TABLE ACCESS FULL| DIM  |     9 |    45 |     2   (0)| 00:00:01 |                                                                                                                                                                                                                                
----------------------------------------------------------------------------                                                                                                                                                                                                                                

A pretty clean looking piece of SQL, but multiple scans of tables make my hackles rise. If this were a very large table and indexes couldn’t be leveraged then it could be a performance problem.

An alternative occured to me:

insert all
when 1               = 1 then into hier values (city   ,state  ,'City'   )
when state_row_ind   = 1 then into hier values (state  ,region ,'State'  )
when region_row_ind  = 1 then into hier values (region ,country,'Region' )
when country_row_ind = 1 then into hier values (country,null   ,'Country')
select city,
       state,
       region,
       country,
       row_number() over (partition by region  order by 1) region_row_ind,
       row_number() over (partition by state   order by 1) state_row_ind,
       row_number() over (partition by country order by 1) country_row_ind
from dim
/

A single scan of the dim table is performed, with three window sorts:

--------------------------------------------------------------------------------                                                                                                                                                                                                                            
| Id  | Operation               | Name | Rows  | Bytes | Cost (%CPU)| Time     |                                                                                                                                                                                                                            
--------------------------------------------------------------------------------                                                                                                                                                                                                                            
|   0 | INSERT STATEMENT        |      |     9 |   963 |     5  (60)| 00:00:01 |                                                                                                                                                                                                                            
|   1 |  MULTI-TABLE INSERT     |      |       |       |            |          |                                                                                                                                                                                                                            
|   2 |   INTO                  | HIER |       |       |            |          |                                                                                                                                                                                                                            
|   3 |   INTO                  | HIER |       |       |            |          |                                                                                                                                                                                                                            
|   4 |   INTO                  | HIER |       |       |            |          |                                                                                                                                                                                                                            
|   5 |   INTO                  | HIER |       |       |            |          |                                                                                                                                                                                                                            
|   6 |    VIEW                 |      |     9 |   963 |     5  (60)| 00:00:01 |                                                                                                                                                                                                                            
|   7 |     WINDOW SORT         |      |     9 |   288 |     5  (60)| 00:00:01 |                                                                                                                                                                                                                            
|   8 |      WINDOW SORT        |      |     9 |   288 |     5  (60)| 00:00:01 |                                                                                                                                                                                                                            
|   9 |       WINDOW SORT       |      |     9 |   288 |     5  (60)| 00:00:01 |                                                                                                                                                                                                                            
|  10 |        TABLE ACCESS FULL| DIM  |     9 |   288 |     2   (0)| 00:00:01 |                                                                                                                                                                                                                            
--------------------------------------------------------------------------------

The cost of the two methods is exactly the same, but the cardinalities are both out — the correct answer is “19″. Of the two methods I’d expect the first to be more accurate and it did do pretty well. With such a small amount of data it’s difficult to make an accurate comparison, and the proper approach would be to benchmark each method on the production data.

One interesting difference is that with the single table insert method you can control the order of insert into the hierarchy table, which might be important if you wanted to physically cluster particular rows together. I don’t think that there is a deterministic way of controlling the insert order in the multitable insert.

Anyway, that was it. No rocket science, just an idle thought.

Posted in Data Warehousing, Oracle, Performance | 1 Comment »

How to Create Multiple Indexes in a Single DDL Statement

Posted by David Aldridge on 2008-03-26

First, raise an enhancement request with your database vendor …

Yes, don’t get excited — I don’t know of a way to do it. I wish there was though, and the ability to rebuild multiple indexes (or partitions of different indexes) at the same time would be useful because …

Well the problem is in the domain of data warehousing — frankly, all the interesting problems are there ;) — and it arises in ETL operations when we want to load to an unindexed tables and then create or rebuild multiple indexes or index partitions on that data. With each of those index operations we need a full table or partition scan (you know, let’s just assume the “or partition” bit from now on) and for a large data set that may mean very high physical read rates are required.  Although we can avoid or mitigate the physical reads by building multiple indexes at the same time in different sessions using dbms_jobs or dbms_scheduler and relying on caching in the SGA or at a hardware level, we still incur the logical reads.

So I’d like to be able to create multiple indexes based on a single scan of the data segment. A “Create Indexes” statement in which some slave process spits out the required columns and rowids to a set of processes which handle the sorts etc.. Even if it meant writing a temporary (compresed?) data set of the required columns that then gets scanned multiple times then as long as we’re not indexing the majority of the columns on the table we could still see a net benefit.

I really can’t think of a downside to this. Of course, you could try to build fifty indexes on a fact table in a single operation and end up cooking your CPU’s and/or blowing out your temporary disk space and/or memory and incurring horrendous levels of the dreaded multipass sort, but so much of what we do is finding that sweet spot between two extremes anyway. Maybe building those fifty indexes in ten groups of five would reduce the number of reads of the data segment by a factor of five without spilling the sorts to disk, and would also put some decent load on those pricey CPU’s.

Enhancement request time again, then.

Actually, I suppose you can create mulitple indexes in a single DDL statement …

Create table my_table
  (col1 constraint xpk_my_table primary key,
  col2 constraint unq_my_table)
as
select blah
from whatdjamacallit;

Cheating really though, and I don’t suppose that it avoids rescanning the data segment. If I had a little time I’d test that with a bit of sql tracing.

“How do you create multiple indexes with a single statement” … now that’s quite the interesting interview question.

Posted in Data Warehousing, Oracle, Performance | 5 Comments »

Big News … round here, anyway

Posted by David Aldridge on 2008-03-24

So it turns out that the proposed merger between XM Satellite Radio (my employer) and Sirius Satellite Radio was just approved by the Department of Justice. Just the FCC to go now.

Interesting times ahead … and a bunch of stuff I was working on is probably going to get cancelled, with a bit of luck. On the other hand, a different bunch of really complex stuff is certainly going to go ahead. You win some, you lose some.

Oh well, at least it’s good to have an idea where you’re going.

Posted in Personal | 4 Comments »

Interviewing: The difference between Delete and Truncate in Oracle

Posted by David Aldridge on 2008-03-17

I interviewed a couple of senior ETL developer candidates last week, so I shook the dust off of some of my tried-and-true probes for technical knowledge.

I’ve always had a liking for asking for the differences between “truncate” and “delete” ever since I was asked it myself some years ago. I think it allows candidates to demonstrate a nice range of knowledge.

Here are some answers in no particular order.

  1. “Truncate is fast, delete is slow”. Hmmm, well that’s mostly true I guess. The most basic or all answers.
  2. “Truncate is DDL, delete is DML”. Is this just so obvious that people don’t mention it?
  3. “You can’t rollback a truncate”. Indeed. See 2 above, of course. And 1.
  4. “Truncate is implicitly a commit”. A better answer than 3 I think. I seem to recall that someone once mentioned that there are actually two commits in a truncate, one before and one after … but maybe I dreamed that. I should test it really.
  5. “You can’t grant permission to truncate a table”. Ah, practical experience shines through. If you don’t like your ETL process to connect as the owner of the schema then this is a challenge that has to be overcome with stored procedures or something sophisticated like that. You really don’t want to grant “DROP ANY TABLE” to your ETL user.
  6. “You can delete any subset of rows, but you can only truncate the complete table, or a partition or subpartition of it”. Is this also so obvious that nobody mentions it?
  7. “Truncate makes unusable indexes usable again”. A real gotcha for the unwary. If you attempt to optimise a data load by rendering indexes unusable and truncating a table (possibly followed by an index rebuild and a partition exchange) then be careful of the order.
  8. “Truncate can’t maintain foreign keys”. It’s “cascading delete”, not “cascading truncate”. That would be an interesting feature though — point 4 above would make it a little trickier. Truncating an index cluster is pretty close to a “cascading truncate” to a limited extent though. In any case no truncate is permitted on a table referenced by foreign keys.
  9. “You can’t flashback a truncate”. This is an oddity to me. We can flashback a “drop table”, rollback uncommited deletes, or use flashback to recover pre-commit deleted data, but a truncate is a barrier across which we cannot flashback.
  10. “Truncate deallocates space, delete doesn’t”. Unless you want it not to, using the “reuse storage” clause. However the high water mark is reset in either case so maybe that’s a better answer …
  11. “Truncate resets the high water mark, delete doesn’t”. And on the indexes, also. See 7 above also.
  12. “Truncate allows special treatment of materialized view logs”. More of that practical experience and/or documentation-reading skill shows through.
  13. “DML triggers do not fire on a truncate”. Because … um … it’s DDL not DML.

I suppose there are other issues as well, but this shows that it’s a complex business.  I really wish we could grant “Truncate table” to a user or role … enhancement request time I think.

Isn’t it interesting that in some circumstances only one method can be used to remove all rows from a table in a single step, sometimes both methods can. In other circumstances both methods can remove a subset of rows from a table, and sometimes both can.

Oh well, back to work.

Posted in Data Warehousing, Materialized Views, Oracle, Partitioning, Performance | 22 Comments »

TIMESTAMPs, Old Fogeys, and Data Warehouses

Posted by David Aldridge on 2008-03-07

I scanned the structure of a data warehouse staging area yesterday, and developed a mild nervous tic. The cause: finding TIMESTAMP(6) data types scattered around willy-nilly.

Firstly, my natural caution and fear of the unknown was aroused by seeing that microsecond timings were being stored, even if it was to log times of inserts and updates of rows and not related to actual business data. My preference is usually to log all changes associated with a single load process with a single timestamp — in Informatica I derive that from the session timestamp so that it is a little easier to lookup the mapping execution that performed the load. If I want to know about performance of the order in which rows are inserted or updated then I’d probably go for turing on verbose logging in the workflow — I’ve rarely found the need to do so.

Secondly, and this is a more long-lasting and disturbing effect, have I officially turnded into an old fogey now? “I’ll hold no truck with these new-fangled data types … DATE was good enough for me, and it’ll be good enough for these young whippersnaper … if the good lord had intended us to measure in microseconds …” etc etc. Maybe I’m still having trouble letting go of the SQL*Plus COPY command, which I used just yesterday to move data from one test database to another, and which probably saved me about a week of messing around otherwise. TIMESTAMP is not supported by the sadly-deprecated COPY command.

Posted in Data Warehousing, Oracle | 8 Comments »

ApEx 3.1 New Feature Causes Mind to Boggle

Posted by David Aldridge on 2008-03-06

The specific cause of the bogglation is the “Interactive Reporting” feature, which you can take for a test drive here. Click on the “Interactive Reports” tab and then poke the “View Customized Interactive Reports” button.  Integrating Flashback Query (click on the little cog wheel) is a little stroke of genius, and a testament to the integration of ApEx with Oracle’s core features, and I think the ApEx developers must be having far too much fun with this project.

We’re standing up ApEx v3 here for managing reference data as soon as we can get the development environment available (shouldn’t be many more months now!) and the new features of 3.1 seem pretty compelling. Interactive Reporting, PDF’s and Emailing Attachments are a powerful combination.

I am also amused that Oracle not only has added another reporting tool to its stable, but that it shares a common name with Oracle’s Hyperion Interactive Reporting — is anyone Googling these names before they assign them?

Posted in Oracle | 7 Comments »

Buffer Busy Waits

Posted by David Aldridge on 2008-03-05

Here’s a curiosity. Or a banal observation, depending on your perspective.

There are, broadly speaking, two different causes of buffer busy waits. They are nicely described in the documentation here.

  1. Waiting for a block to be read into the buffer cache by a different session.
  2. Waiting for a block in memory to become available in some way (maybe it’s being modified by another session, for example)

The second of these is likely to be observed in an OLTP system, the first of them is more likely in a reporting system with multiple nonparallel full table/partition scans.

The curiosity is that the cure for these two causes are exactly the opposite of each other. To reduce the number of buffer busy waits you either have to increase the number of rows per block (to reduce the number of blocks needed to fulfill a full table scan) or reduce the number of rows per block (to reduce the chance of multiple sessions requiring rows that are in the same block).

So what to make of a consultant who sees a high number of buffer busy waits on a read-only reporting system, and advises that you increase pctfree or partition the table (sic) to reduce the number of rows per block?

Incidentally, I have a headache. Possibly it is caused by spending so much of the day frowning.

Posted in Data Warehousing, Oracle, Performance | 10 Comments »