ETL Powered by Rice and Beans

I’m told that it’s very important to keep one’s brain in shape as one grows older. Or “drifts gently towards the twilight” as I prefer to think of it. Trying to look on the optimistic side, obviously. Fortunately I end each day with my brain wrung out like a well-used sponge courtesy of some ETL development work on a largeish data warehouse development.

Although the regular ETL development is outsourced, we are responsible for the coding that processes historical data, so on deployment we will run our historical load code and then the “other team’s” regular load will take over. Yes, it is a strange arrangement. And by “we” I really mean my collegues somewhere in the pestilent jungles of Brazil, surrounded by poisonous frogs and who-knows-what other kind of other fearsome thing, and living on a diet of rice and beans in their houses on stilts.

My own role is not so bad. I don’t have to get too involved in the detailed busness logic that the code has to implement, and thank goodness for that because frankly I’ve not seen anything so complex for … well, forever. So I sit here in a metaphorically lofty perch, making myself popular by criticising the hard work of others and encouraging them to greater efforts by calling them lazy if they stop work before midnight (the traditional management role). Yes, it’s a good life alright. I constantly remind them that the real aim of the project is to produce a process for loading five years of historical data that is faster than the “B-Team’s” load of single day.

I think our methodology works quite well though. When you are down in the weeds of the logic (did I mention that it is particularly complex logic?) it is always difficult to see the slightly larger picture of where you’re taking the data next and how you optimise both specific ETL stages and how you can adjust the overall process to make everything more efficient. It’s a little like pair programming, I guess.

And we do get to do the fun bit — 10 million records joining 0-M:0-N through eight OR’d join conditions to 10 million other records, with a couple of billion joins resulting that we then artfully winnow down to a few million. The optimization process is a little mind bending, with all sorts of issues that we wouldn’t normally worry about too much becoming vitally important in an effort to avoid blowing out the host server’s memory with a degree of parallelism of 40. I have to say that there was initially some FUD over the historical process, with estimates of the load duration being bandied about that were in the order of months — we never thought it would take more than five days, and it’s now looking like it’ll run in under one day, which means that we’d be able to start the daily load process directly afterwards without altering it. And I think we still have a few more tricks up our sleeves as well in case we need to buy an extra hour somehow.

In fact it’s turning into something of a stroll through the SQL Reference book. The list of features we’re touching on include analytic functions (which have become a second language for everyone now, along with the “PCTFREE 0 NOLOGGING COMPRESS” which I nearly appended to my sandwich order the other day), connect by clauses, and even some rather exotic stuff with finding the lowest set bit in a something-or-other that I don’t understand yet but which Heitor thought of while lying in bed one night and which he used to help reduce an eight hour process to fifteen minutes.  They say that they don’t have siestas in Brazil, but maybe they should start.

 

Oh well, another morning … Brazil is online … back to work!

Indexing Options for Change Data Capture

I just posted this question to the Oracle list, but thought I might post it here in case that reaches a wider audience. It also may be a situation that others in the DW arena have faced before.

 

I have a large and busy OLTP table, 100GB or so, against which there is a need to capture changes. Until an asynchronous CDC solution is in place we have to rely on two columns: create_date and update_date (null until the first update), both being of DATE type of course.

These are currently unindexed, but there is a desire to index them to improve change capture performance for queries such as:

 
select …
from   …
where     (create_dt >= trunc(sysdate)-1 and create_dt < trunc(sysdate))
       or (update_dt >= trunc(sysdate)-1 and update_dt < trunc(sysdate))
 
The desire is obviously to provide the maximum performance benefit while reducing the impact on the OLTP system.
 
I thought of four different indexing options:
 
i) Indexing the columns separately, leading in the best case to an unusual execution plan where the indexes are range scanned and the results merged before accessing the table.
ii) A single composite index (create_date,update_date),leading to a fast full index scan.
iii) A single composite index (update_date,create_date), rewriting the query predicate as …
   (create_dt >= trunc(sysdate)-1 and create_dt < trunc(sysdate) and update_date is null)
or (update_dt >= trunc(sysdate)-1 and update_dt < trunc(sysdate))
… and leading to two index range scans. (not sure about this)
iv) A single-column function based index on (Nvl(update_dt,create_dt)) and rewriting the predicate appropriately.
Whichever of these is chosen the growth pattern for the data naturally tends towards index contention as all the new values are equal to sysdate.
 
So the nub of my problem comes down to these questions:
 
Is it possible to mitigate the index contention problem with a suitably high initrans values? Does it seem likely that implementing a reverse index would be worthwhile in reducing contention (I could probably take the pain of the fast full scan if it was, although it would lead to a larger index due to the 50/50 block splits).
 
Would you regard implementing a function-based index as risky in any significant way on major tables of a busy OLTP system? How about a reverse function based index?
 
Basically, “what would you do”?
 
Maybe I’m overthinking it, but I’d like to go to the sytem developers with a fully thought out set of options (even if they do tell me to take a running jump anyway).

 

 

250,000 Tests for Uniqueness Per Second — Ain’t No Biggie

When you’re designing ETL processes the mantra of “do it in SQL” realy gets engrained in your way of thinking. PL/SQL-based processing as a method of handling business logic is anathema, as the principle of “row-by-row equals slow-by-slow” is very well highlighted by the processing of millions of rows.

Very recently though I found a case where this principle had to be violated, and with performance results that were not at all unpleasant. I was stumped by a SQL problem that I thought would be amenable to analytic functions, but I couldn’t think of a way of doing it. Here’s the situation.

I have two sets of data representing, let us say, sources and targets. Sources and targets only have one attribute each, which is a date in each case, and sources and targets can be joined in a many-to-many relationship. However only one join is valid per source and only one is valid per target, and we have to form connections between sources and targets on a one-to-one basis. We work out which one of the possbily-many joins is the valid connection according to the following logic.

We start with the most recent source. We work out the earliest target to which it is joined, and discard all other targets to which it is joined (if any). The target to which it connected is now ineligible for any other source to connect to it. Then we take the next most recent source and we work through the targets in ascending target-date order and we connect to the first (ie. earliest) target available, discarding all others. The only reason why a target might not be available is that we already connected a source to it.

One way of looking at this is to imagine that the set of joined sources and targets are ordered by source-date descending and then target-date ascending. Examining each pair in turn we ask two questions:

  1. “Has this source already connected to a target?”
  2. “Has this target already connected to a source?”

If the answer to both of those is “No” then we take that join as the valid connection and we move on to the next pair.

It’s very easy to construct a sample data set for this problem as well, by simply joining a list of integers to itself to form pairs such as (1,1), (1,2), (1,3), … , (1000,1000) and ordering the output with a query such as this:

select s.num s_num,t.num t_num from 
(select rownum num from dual connect by level <= 1000) s, 
(select rownum num from dual connect by level <= 1000) t 
order by 1 desc, 2 asc 
/ 
 

Given that data set the first pair is (1000,1), which becomes the first connection. The next connection will be (999,2), the next will be (998,3) etc.. It’s important to note that the real data set is not like that though, where every source connects to every target, so special conditions in this data set do not apply to the real one.

By the way, there are some special conditions that are easy even in the real data — if a row represents the first occurance for both the source and the target then that is always going to be the valid connection. Um … that’s actually the only special condition I could think of. There may be some very clever mathematical way of working out others, but I don’t know what it is.

So, the first solution I came up with was to use PL/SQL to read each row in order and attempt to insert into a table having unique constraints on the source_id and target_id columns, handling any uniqueness error. The handling of the error is the only virtue that PL/SQL brings to this case. It was functional, but it was slow because of course it requires a lot of logical reads and writes in the index maintenance. I considered an insert with an error-logging clause by the way, but that doesn’t handle unique or primary key constraint errors and if it did it would still have the same problem of high logical reads and writes.

The next solution was to maintain a pair of PL/SQL associative arrays to handle the detection of uniqueness, and that worked well except that I was using conventional path inserts for the result set, and I didn’t fancy coding up bulk inserts. I know enough PL/SQL to be dangerous but it’s not my native language.

My (currently) final solution was to embed the PL/SQL into a package so that a function to detect uniqueness could be embedded in a sql statement. I was pretty concerned about the context switch to PL/SQL, and the logic means that this process can’t be parallelised, but the result suprised me. When I streamed 100,000,000 rows through the process to detect 10,000 valid connections the process completed in 770 seconds, so that was at least 250,000 uniqueness tests per second. I can’t imagine that being achieved based on detecting violations of a unique constraint.

Here’s the package I used. Sorry about the crappy formatting:

CREATE OR REPLACE PACKAGE ss 
is 
  TYPE NumTyp IS TABLE OF NUMBER INDEX BY pls_integer; 
  src_tab   NumTyp; 
  tgt_tab   NumTyp; 
  empty_tab numtyp; procedure reset; function is_new(s_num IN NUMBER, t_num NUMBER) 
return NUMBER; 
end; 
/ 
CREATE OR REPLACE PACKAGE body ss 
is 
PROCEDURE reset 
is 
BEGIN 
    src_tab := empty_tab; 
    tgt_tab := empty_tab; 
end; function is_new (s_num NUMBER, t_num NUMBER) 
return number 
is 
BEGIN 
   IF src_tab.exists(s_num) OR tgt_tab.exists(t_num) 
   then 
      RETURN 0; 
   else 
      src_tab(s_num) := 1; 
      tgt_tab(t_num) := 1; 
      RETURN 1; 
   end if; 
end; 
end; 
/ 
 

The complete implementation can be tested as follows:

 exec ss.reset; 
        
select * from        
(select s.num s_num,t.num t_num 
from   (select rownum num from dual connect by level <= 1000) s, 
       (select rownum num from dual connect by level <= 1000) t 
where rownum > 0 
order by 1 desc, 2 asc) 
where ss.is_new(s_num,t_num) = 1 
/

The call to ss.reset purges the arrays, and must be executed between runs. The rownum > 0 predicate is just the poor mans way of ensuring that the call to ss.is_new is not pushed into the inline view — Oracle obviously doesn’t know that the row ordering is a critical part of the logic.

Obviously an INSERT (with append hint to invoke direct path) can be prepended to that to perform the actual ETL task.

Multitable Insert and Distinct Values

I puzzled a little on the way to work about efficiently populating a hierarchy table based on a flat dimension. (“Why?” … seriously, you just don’t want to know that).

So let us suppose you have a table like this:

create table dim 
  (city varchar2(30), 
  state varchar2(30), 
  region varchar2(30), 
  country varchar2(30)) 
/

… and data like this … 

insert into dim values ('Fairfax'         ,'Virginia','US-East'   ,'USA'  ); 
insert into dim values ('Merrifield'      ,'Virginia','US-East'   ,'USA'  ); 
insert into dim values ('Alexandria'      ,'Virginia','US-East'   ,'USA'  ); 
insert into dim values ('Portland'        ,'Maine'   ,'US-East'   ,'USA'  ); 
insert into dim values ('Colorado Springs','Colorado','US-Central','USA'  ); 
insert into dim values ('Denver'          ,'Colorado','US-Central','USA'  ); 
insert into dim values ('Longmont'        ,'Colorado','US-Central','USA'  ); 
insert into dim values ('Salt Lake City'  ,'Utah'    ,'US-Central','USA'  ); 
insert into dim values ('Rome'            ,'Lazio'   ,'IT-Central','Italy');

… and you want to turn it into a hierarchy in the following table …

create table hier 
(attribute varchar2(30), 
parent varchar2(30), 
lvl varchar2(30)) 
/

You could do it as follows:

insert into hier 
select          city   , state  , 'City'    from dim 
union all 
select distinct state  , region , 'State'   from dim 
union all 
select distinct region , country, 'Region'  from dim 
union all 
select distinct country, null   , 'Country' from dim 
/

No distinct on the first select clause because we assume that City is unique (it would be if it were the PK on a real dimension, of course). In the absence of indexes to support fast full index scans we have four full table scans and three operations to support the distinct operator …

----------------------------------------------------------------------------                                                                                                                                                                                                                                 
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |                                                                                                                                                                                                                                 
----------------------------------------------------------------------------                                                                                                                                                                                                                                 
|   0 | INSERT STATEMENT    |      |    25 |   418 |    11  (82)| 00:00:01 |                                                                                                                                                                                                                                 
|   1 |  UNION-ALL          |      |       |       |            |          |                                                                                                                                                                                                                                 
|   2 |   TABLE ACCESS FULL | DIM  |     9 |   171 |     2   (0)| 00:00:01 |                                                                                                                                                                                                                                 
|   3 |   HASH UNIQUE       |      |     9 |   162 |     3  (34)| 00:00:01 |                                                                                                                                                                                                                                 
|   4 |    TABLE ACCESS FULL| DIM  |     9 |   162 |     2   (0)| 00:00:01 |                                                                                                                                                                                                                                 
|   5 |   HASH UNIQUE       |      |     5 |    75 |     3  (34)| 00:00:01 |                                                                                                                                                                                                                                 
|   6 |    TABLE ACCESS FULL| DIM  |     9 |   135 |     2   (0)| 00:00:01 |                                                                                                                                                                                                                                 
|   7 |   HASH UNIQUE       |      |     2 |    10 |     3  (34)| 00:00:01 |                                                                                                                                                                                                                                 
|   8 |    TABLE ACCESS FULL| DIM  |     9 |    45 |     2   (0)| 00:00:01 |                                                                                                                                                                                                                                 
----------------------------------------------------------------------------                                                                                                                                                                                                                                

A pretty clean looking piece of SQL, but multiple scans of tables make my hackles rise. If this were a very large table and indexes couldn’t be leveraged then it could be a performance problem.

An alternative occured to me:

insert all 
when 1               = 1 then into hier values (city   ,state  ,'City'   ) 
when state_row_ind   = 1 then into hier values (state  ,region ,'State'  ) 
when region_row_ind  = 1 then into hier values (region ,country,'Region' ) 
when country_row_ind = 1 then into hier values (country,null   ,'Country') 
select city, 
       state, 
       region, 
       country, 
       row_number() over (partition by region  order by 1) region_row_ind, 
       row_number() over (partition by state   order by 1) state_row_ind, 
       row_number() over (partition by country order by 1) country_row_ind 
from dim 
/

A single scan of the dim table is performed, with three window sorts:

--------------------------------------------------------------------------------                                                                                                                                                                                                                             
| Id  | Operation               | Name | Rows  | Bytes | Cost (%CPU)| Time     |                                                                                                                                                                                                                             
--------------------------------------------------------------------------------                                                                                                                                                                                                                             
|   0 | INSERT STATEMENT        |      |     9 |   963 |     5  (60)| 00:00:01 |                                                                                                                                                                                                                             
|   1 |  MULTI-TABLE INSERT     |      |       |       |            |          |                                                                                                                                                                                                                             
|   2 |   INTO                  | HIER |       |       |            |          |                                                                                                                                                                                                                             
|   3 |   INTO                  | HIER |       |       |            |          |                                                                                                                                                                                                                             
|   4 |   INTO                  | HIER |       |       |            |          |                                                                                                                                                                                                                             
|   5 |   INTO                  | HIER |       |       |            |          |                                                                                                                                                                                                                             
|   6 |    VIEW                 |      |     9 |   963 |     5  (60)| 00:00:01 |                                                                                                                                                                                                                             
|   7 |     WINDOW SORT         |      |     9 |   288 |     5  (60)| 00:00:01 |                                                                                                                                                                                                                             
|   8 |      WINDOW SORT        |      |     9 |   288 |     5  (60)| 00:00:01 |                                                                                                                                                                                                                             
|   9 |       WINDOW SORT       |      |     9 |   288 |     5  (60)| 00:00:01 |                                                                                                                                                                                                                             
|  10 |        TABLE ACCESS FULL| DIM  |     9 |   288 |     2   (0)| 00:00:01 |                                                                                                                                                                                                                             
--------------------------------------------------------------------------------

The cost of the two methods is exactly the same, but the cardinalities are both out — the correct answer is “19”. Of the two methods I’d expect the first to be more accurate and it did do pretty well. With such a small amount of data it’s difficult to make an accurate comparison, and the proper approach would be to benchmark each method on the production data.

One interesting difference is that with the single table insert method you can control the order of insert into the hierarchy table, which might be important if you wanted to physically cluster particular rows together. I don’t think that there is a deterministic way of controlling the insert order in the multitable insert.

Anyway, that was it. No rocket science, just an idle thought.

How to Create Multiple Indexes in a Single DDL Statement

First, raise an enhancement request with your database vendor …

Yes, don’t get excited — I don’t know of a way to do it. I wish there was though, and the ability to rebuild multiple indexes (or partitions of different indexes) at the same time would be useful because …

Well the problem is in the domain of data warehousing — frankly, all the interesting problems are there ;) — and it arises in ETL operations when we want to load to an unindexed tables and then create or rebuild multiple indexes or index partitions on that data. With each of those index operations we need a full table or partition scan (you know, let’s just assume the “or partition” bit from now on) and for a large data set that may mean very high physical read rates are required.  Although we can avoid or mitigate the physical reads by building multiple indexes at the same time in different sessions using dbms_jobs or dbms_scheduler and relying on caching in the SGA or at a hardware level, we still incur the logical reads.

So I’d like to be able to create multiple indexes based on a single scan of the data segment. A “Create Indexes” statement in which some slave process spits out the required columns and rowids to a set of processes which handle the sorts etc.. Even if it meant writing a temporary (compresed?) data set of the required columns that then gets scanned multiple times then as long as we’re not indexing the majority of the columns on the table we could still see a net benefit.

I really can’t think of a downside to this. Of course, you could try to build fifty indexes on a fact table in a single operation and end up cooking your CPU’s and/or blowing out your temporary disk space and/or memory and incurring horrendous levels of the dreaded multipass sort, but so much of what we do is finding that sweet spot between two extremes anyway. Maybe building those fifty indexes in ten groups of five would reduce the number of reads of the data segment by a factor of five without spilling the sorts to disk, and would also put some decent load on those pricey CPU’s.

Enhancement request time again, then.

Actually, I suppose you can create mulitple indexes in a single DDL statement …

Create table my_table
  (col1 constraint xpk_my_table primary key,
  col2 constraint unq_my_table)
as
select blah
from whatdjamacallit;

Cheating really though, and I don’t suppose that it avoids rescanning the data segment. If I had a little time I’d test that with a bit of sql tracing.

“How do you create multiple indexes with a single statement” … now that’s quite the interesting interview question.

Interviewing: The difference between Delete and Truncate in Oracle

I interviewed a couple of senior ETL developer candidates last week, so I shook the dust off of some of my tried-and-true probes for technical knowledge.

I’ve always had a liking for asking for the differences between “truncate” and “delete” ever since I was asked it myself some years ago. I think it allows candidates to demonstrate a nice range of knowledge.

Here are some answers in no particular order.

  1. “Truncate is fast, delete is slow”. Hmmm, well that’s mostly true I guess. The most basic or all answers.
  2. “Truncate is DDL, delete is DML”. Is this just so obvious that people don’t mention it?
  3. “You can’t rollback a truncate”. Indeed. See 2 above, of course. And 1.
  4. “Truncate is implicitly a commit”. A better answer than 3 I think. I seem to recall that someone once mentioned that there are actually two commits in a truncate, one before and one after … but maybe I dreamed that. I should test it really.
  5. “You can’t grant permission to truncate a table”. Ah, practical experience shines through. If you don’t like your ETL process to connect as the owner of the schema then this is a challenge that has to be overcome with stored procedures or something sophisticated like that. You really don’t want to grant “DROP ANY TABLE” to your ETL user.
  6. “You can delete any subset of rows, but you can only truncate the complete table, or a partition or subpartition of it”. Is this also so obvious that nobody mentions it?
  7. “Truncate makes unusable indexes usable again”. A real gotcha for the unwary. If you attempt to optimise a data load by rendering indexes unusable and truncating a table (possibly followed by an index rebuild and a partition exchange) then be careful of the order.
  8. “Truncate can’t maintain foreign keys”. It’s “cascading delete”, not “cascading truncate”. That would be an interesting feature though — point 4 above would make it a little trickier. Truncating an index cluster is pretty close to a “cascading truncate” to a limited extent though. In any case no truncate is permitted on a table referenced by foreign keys.
  9. “You can’t flashback a truncate”. This is an oddity to me. We can flashback a “drop table”, rollback uncommited deletes, or use flashback to recover pre-commit deleted data, but a truncate is a barrier across which we cannot flashback.
  10. “Truncate deallocates space, delete doesn’t”. Unless you want it not to, using the “reuse storage” clause. However the high water mark is reset in either case so maybe that’s a better answer …
  11. “Truncate resets the high water mark, delete doesn’t”. And on the indexes, also. See 7 above also.
  12. “Truncate allows special treatment of materialized view logs”. More of that practical experience and/or documentation-reading skill shows through.
  13. “DML triggers do not fire on a truncate”. Because … um … it’s DDL not DML.

I suppose there are other issues as well, but this shows that it’s a complex business.  I really wish we could grant “Truncate table” to a user or role … enhancement request time I think.

Isn’t it interesting that in some circumstances only one method can be used to remove all rows from a table in a single step, sometimes both methods can. In other circumstances both methods can remove a subset of rows from a table, and sometimes both can.

Oh well, back to work.

Buffer Busy Waits

Here’s a curiosity. Or a banal observation, depending on your perspective.

There are, broadly speaking, two different causes of buffer busy waits. They are nicely described in the documentation here.

  1. Waiting for a block to be read into the buffer cache by a different session.
  2. Waiting for a block in memory to become available in some way (maybe it’s being modified by another session, for example)

The second of these is likely to be observed in an OLTP system, the first of them is more likely in a reporting system with multiple nonparallel full table/partition scans.

The curiosity is that the cure for these two causes are exactly the opposite of each other. To reduce the number of buffer busy waits you either have to increase the number of rows per block (to reduce the number of blocks needed to fulfill a full table scan) or reduce the number of rows per block (to reduce the chance of multiple sessions requiring rows that are in the same block).

So what to make of a consultant who sees a high number of buffer busy waits on a read-only reporting system, and advises that you increase pctfree or partition the table (sic) to reduce the number of rows per block?

Incidentally, I have a headache. Possibly it is caused by spending so much of the day frowning.

Oracle Siebel Analytics Data Warehouse: Part II

Following on from this, it occured to me this morning that accurate systems statistics may be an important performance factor for us, with all of these indexing options that the designers so thoughtfully provided. *ahem*

This is because there is much evidence that the SANs we connect to are tuned for OLTP-stlye high i/o rate performance, rather than high bandwidth BI performance. My usual instinct to promote full table scans in preference to index-based access methods when accessing a high proportion of a table have led me astray a couple of times recently and it is specifically the balance between index-based single block reads and the multiblock reads associated with table/partition scans (hypothetically supposing that we have partitions) that I want the optimizer to get right.

This very regular reason for statistics aside, we also have another interesting issue — a Very Important System that has a Very Particular Problem. Although a full-sized performance environment (PRF) is available (and used) before every release, the performance characteristics of the storage are exactly the opposite on the performance system as they are on the production (PRD) system. Specifically, PRF is fast to read and slow to write, and PRD is slow to read and fast to write (or is it the other way round? I forget). The mount points are on different SANs for security reasons (things are generally buttoned-down very tightly here, as you’d probably expect of a company with numerous external partners and a very public website) so that probably accounts for the effect in some way, but the implications are rather interesting — if we want to improve PRD performance, we have to tune the statements in PRF that are already fast.

So creating indexes in PRF is fast, because the read is large and the write is small. In production … well, let us just say it’s a bit of a time consumer. Creating a large summary table with a high volume of data performs about the same in both systems but the critical path for performance is different.

System statistics sadly do not distinguish between read and write performance in 10g (afaik), but wouldn’t it be interesting if they did? It’s a little thought experiment for later … “What database design and optimizer decisions might be affected by different balances between read and write performance?”

Hmmm …

Oh, I upped the optimizer_dynamic_sampling from 2 to 4 … or at least I raised a Remedy ticket to have a DBA do it. Another symptom of a buttoned-down environment :D

Back to a More Simple Time

If we browse the Oracle Data Warehousing Guide we find a number of interesting database structures and features that can make our warehouses easier to load, faster to query, more simple to manage etc.. Partitioning, parallelsim, bitmap indexes, data segment compression, materialized views and query rewrite, for example. It has been a long, long time since I implemented anything of any size without at least two or three of these.

So let us imagine, gentle reader, a return to the days of simplicity when we had none of these — how would that look? How would it perform? Well it turns out that it would look a lot like the data warehouse that comes out of the box with Siebel Marketing Analytics. No partitioning? Check! Only the most ordinary, uncompressed b-tree indexes? Check! Etc.. You get the idea. And how will it perform? Our lucky band of adventurers is about to find out, as we implement it with the bare minimum of modifications to add some custom data elements from our Siebel CRM system.

Some current areas of concern:

  • A fact table with 50 indexes that in development consume 1.5Gb of space for every 1Gb of fact table data. How is the drop-all-indexes-and-rebuild-after-every-load strategy going to turn out with production sized data?
  • The use of updates against the largest fact table. No data segment compression!
  • No foreign keys and bitmap indexes. Do star transformations even work with that?

 On the plus side we do have a performance testing environment sized at around 80% of production data volumes, so we will have an opportunity to tweak the worst offenders. We do have parallelsim and the licensing for everything that we moght want to throw at the system, and we do have Quest Performance Analysis for Oracle already prepped on the test and production instances.

In preparation for performance testing I have a little mental checklist of what can be implemented quickly and quietly to tackle certain problems, such as replacing b-tree indexes with bitmaps, or a little partitioning, but it’s complexified by the presence of the “Datawarehouse Administration Console” (DAC), an apparantly essential component of the architecture that is responsible for executing the 400+ Informatica workflows that load the staging and warehouse tables. The philosophy around it seems to be that anything platform specific, such as dropping and creating indexes, is defined and controlled through DAC rather than through Informatica (where it is generally a little kludgy, evn if it is serviceable). So we have some exploration to do on that product to see how it handles customisation of indexes (maybe it recreates them the way it found them, but I suspect some customisation is required to make it deal with creating “exotics” such as local bitmap indexes).

We go into integration testing this coming week, with performance testing to follow. Should be an interesting ride.