“See-qwel” or “S-Q-L”? Have your say on this highly important matter here, and help shape the future of our industry.
Archive for April, 2008
An Empassioned and Meaningless Debate
Posted by David Aldridge on 2008-04-25
Posted in Oracle | 11 Comments »
ETL Powered by Rice and Beans
Posted by David Aldridge on 2008-04-17
I’m told that it’s very important to keep one’s brain in shape as one grows older. Or “drifts gently towards the twilight” as I prefer to think of it. Trying to look on the optimistic side, obviously. Fortunately I end each day with my brain wrung out like a well-used sponge courtesy of some ETL development work on a largeish data warehouse development.
Although the regular ETL development is outsourced, we are responsible for the coding that processes historical data, so on deployment we will run our historical load code and then the “other team’s” regular load will take over. Yes, it is a strange arrangement. And by “we” I really mean my collegues somewhere in the pestilent jungles of Brazil, surrounded by poisonous frogs and who-knows-what other kind of other fearsome thing, and living on a diet of rice and beans in their houses on stilts.
My own role is not so bad. I don’t have to get too involved in the detailed busness logic that the code has to implement, and thank goodness for that because frankly I’ve not seen anything so complex for … well, forever. So I sit here in a metaphorically lofty perch, making myself popular by criticising the hard work of others and encouraging them to greater efforts by calling them lazy if they stop work before midnight (the traditional management role). Yes, it’s a good life alright. I constantly remind them that the real aim of the project is to produce a process for loading five years of historical data that is faster than the “B-Team’s” load of single day.
I think our methodology works quite well though. When you are down in the weeds of the logic (did I mention that it is particularly complex logic?) it is always difficult to see the slightly larger picture of where you’re taking the data next and how you optimise both specific ETL stages and how you can adjust the overall process to make everything more efficient. It’s a little like pair programming, I guess.
And we do get to do the fun bit — 10 million records joining 0-M:0-N through eight OR’d join conditions to 10 million other records, with a couple of billion joins resulting that we then artfully winnow down to a few million. The optimization process is a little mind bending, with all sorts of issues that we wouldn’t normally worry about too much becoming vitally important in an effort to avoid blowing out the host server’s memory with a degree of parallelism of 40. I have to say that there was initially some FUD over the historical process, with estimates of the load duration being bandied about that were in the order of months — we never thought it would take more than five days, and it’s now looking like it’ll run in under one day, which means that we’d be able to start the daily load process directly afterwards without altering it. And I think we still have a few more tricks up our sleeves as well in case we need to buy an extra hour somehow.
In fact it’s turning into something of a stroll through the SQL Reference book. The list of features we’re touching on include analytic functions (which have become a second language for everyone now, along with the “PCTFREE 0 NOLOGGING COMPRESS” which I nearly appended to my sandwich order the other day), connect by clauses, and even some rather exotic stuff with finding the lowest set bit in a something-or-other that I don’t understand yet but which Heitor thought of while lying in bed one night and which he used to help reduce an eight hour process to fifteen minutes. They say that they don’t have siestas in Brazil, but maybe they should start.
Oh well, another morning … Brazil is online … back to work!
Posted in Data Warehousing, Oracle, Performance | 10 Comments »
Choosing An Optimal Stats Gathering Strategy
Posted by David Aldridge on 2008-04-13
Greg Rahn has an excellent entry on statistics gathering and the use and abuse of dbms_stats and initialization parameters at the Structured Data blog, one of my favourites. I added a comment on dynamic sampling, of which I am a big proponent, and Greg has another recent post on a case where dynamic sampling saved the day. Hey, I commented on that too …
Posted in Oracle | 2 Comments »
Indexing Options for Change Data Capture
Posted by David Aldridge on 2008-04-08
I just posted this question to the Oracle list, but thought I might post it here in case that reaches a wider audience. It also may be a situation that others in the DW arena have faced before.
These are currently unindexed, but there is a desire to index them to improve change capture performance for queries such as:
Posted in Data Warehousing, Oracle, Performance | 26 Comments »
Stored Procedure are Wonderful/Evil
Posted by David Aldridge on 2008-04-07
The same old “stored procedures are wonderful/evil” debate is again being practiced at the Joel on Software forums.
Enjoy.
Posted in Forums, Oracle | Leave a Comment »
250,000 Tests for Uniqueness Per Second — Ain’t No Biggie
Posted by David Aldridge on 2008-04-02
When you’re designing ETL processes the mantra of “do it in SQL” realy gets engrained in your way of thinking. PL/SQL-based processing as a method of handling business logic is anathema, as the principle of “row-by-row equals slow-by-slow” is very well highlighted by the processing of millions of rows.
Very recently though I found a case where this principle had to be violated, and with performance results that were not at all unpleasant. I was stumped by a SQL problem that I thought would be amenable to analytic functions, but I couldn’t think of a way of doing it. Here’s the situation.
I have two sets of data representing, let us say, sources and targets. Sources and targets only have one attribute each, which is a date in each case, and sources and targets can be joined in a many-to-many relationship. However only one join is valid per source and only one is valid per target, and we have to form connections between sources and targets on a one-to-one basis. We work out which one of the possbily-many joins is the valid connection according to the following logic.
We start with the most recent source. We work out the earliest target to which it is joined, and discard all other targets to which it is joined (if any). The target to which it connected is now ineligible for any other source to connect to it. Then we take the next most recent source and we work through the targets in ascending target-date order and we connect to the first (ie. earliest) target available, discarding all others. The only reason why a target might not be available is that we already connected a source to it.
One way of looking at this is to imagine that the set of joined sources and targets are ordered by source-date descending and then target-date ascending. Examining each pair in turn we ask two questions:
-
“Has this source already connected to a target?”
-
“Has this target already connected to a source?”
If the answer to both of those is “No” then we take that join as the valid connection and we move on to the next pair.
It’s very easy to construct a sample data set for this problem as well, by simply joining a list of integers to itself to form pairs such as (1,1), (1,2), (1,3), … , (1000,1000) and ordering the output with a query such as this:
select s.num s_num,t.num t_num from (select rownum num from dual connect by level <= 1000) s, (select rownum num from dual connect by level <= 1000) t order by 1 desc, 2 asc /
Given that data set the first pair is (1000,1), which becomes the first connection. The next connection will be (999,2), the next will be (998,3) etc.. It’s important to note that the real data set is not like that though, where every source connects to every target, so special conditions in this data set do not apply to the real one.
By the way, there are some special conditions that are easy even in the real data – if a row represents the first occurance for both the source and the target then that is always going to be the valid connection. Um … that’s actually the only special condition I could think of. There may be some very clever mathematical way of working out others, but I don’t know what it is.
So, the first solution I came up with was to use PL/SQL to read each row in order and attempt to insert into a table having unique constraints on the source_id and target_id columns, handling any uniqueness error. The handling of the error is the only virtue that PL/SQL brings to this case. It was functional, but it was slow because of course it requires a lot of logical reads and writes in the index maintenance. I considered an insert with an error-logging clause by the way, but that doesn’t handle unique or primary key constraint errors and if it did it would still have the same problem of high logical reads and writes.
The next solution was to maintain a pair of PL/SQL associative arrays to handle the detection of uniqueness, and that worked well except that I was using conventional path inserts for the result set, and I didn’t fancy coding up bulk inserts. I know enough PL/SQL to be dangerous but it’s not my native language.
My (currently) final solution was to embed the PL/SQL into a package so that a function to detect uniqueness could be embedded in a sql statement. I was pretty concerned about the context switch to PL/SQL, and the logic means that this process can’t be parallelised, but the result suprised me. When I streamed 100,000,000 rows through the process to detect 10,000 valid connections the process completed in 770 seconds, so that was at least 250,000 uniqueness tests per second. I can’t imagine that being achieved based on detecting violations of a unique constraint.
Here’s the package I used. Sorry about the crappy formatting:
CREATE OR REPLACE PACKAGE ss is TYPE NumTyp IS TABLE OF NUMBER INDEX BY pls_integer; src_tab NumTyp; tgt_tab NumTyp; empty_tab numtyp; procedure reset; function is_new(s_num IN NUMBER, t_num NUMBER) return NUMBER; end; / CREATE OR REPLACE PACKAGE body ss is PROCEDURE reset is BEGIN src_tab := empty_tab; tgt_tab := empty_tab; end; function is_new (s_num NUMBER, t_num NUMBER) return number is BEGIN IF src_tab.exists(s_num) OR tgt_tab.exists(t_num) then RETURN 0; else src_tab(s_num) := 1; tgt_tab(t_num) := 1; RETURN 1; end if; end; end; /
The complete implementation can be tested as follows:
exec ss.reset; select * from (select s.num s_num,t.num t_num from (select rownum num from dual connect by level <= 1000) s, (select rownum num from dual connect by level <= 1000) t where rownum > 0 order by 1 desc, 2 asc) where ss.is_new(s_num,t_num) = 1 /
The call to ss.reset purges the arrays, and must be executed between runs. The rownum > 0 predicate is just the poor mans way of ensuring that the call to ss.is_new is not pushed into the inline view — Oracle obviously doesn’t know that the row ordering is a critical part of the logic.
Obviously an INSERT (with append hint to invoke direct path) can be prepended to that to perform the actual ETL task.
Posted in Data Warehousing, Oracle, Performance | 9 Comments »
Discarding the Output of a Select Statement
Posted by David Aldridge on 2008-04-01
I’ve wondered before about a convenient way to tackle a particular problem — you want to run a select for a big bunch of rows, maybe for measuring wait events for a tricky query, but you want to discard the output so you don’t incur the network traffic and waits to the client. You could use autotrace in sql*plus, but how about something more generic? I’ve seen queries pushed to an in-line view with a 1=0 predicate and a no_push_pred (?) hint, and even seen people erroneously replace the projected rows with a simple count(*) — the latter often affecting the execution plan as an unintended side-effect. But it’s not like you can just pipe it to /dev/null, and you want something pretty tool independent.
Then I thought of this:
insert all when 1=0 then into s_order select * from s_order where ... /
It simply discards all the rows projected from the select statement, and the optimizer doesn’t push the 1=0 condition into the select statement as a plan-modifying predicate either (10.2).
explain plan for insert into s_order select * from s_order / select * from table(dbms_xplan.display) / ----------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ----------------------------------------------------------------------------- | 0 | INSERT STATEMENT | | 11M| 5052M| 446K (1)| 02:13:56 | | 1 | TABLE ACCESS FULL| S_ORDER | 11M| 5052M| 446K (1)| 02:13:56 | ----------------------------------------------------------------------------- explain plan for insert all when 1=0 then into s_order select * from s_order / select * from table(dbms_xplan.display) / ------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------- | 0 | INSERT STATEMENT | | 11M| 5052M| 446K (1)| 02:13:56 | | 1 | MULTI-TABLE INSERT | | | | | | | 2 | INTO | S_ORDER | | | | | | 3 | TABLE ACCESS FULL| S_ORDER | 11M| 5052M| 446K (1)| 02:13:56 | ------------------------------------------------------------------------------- explain plan for insert all when 1=0 then into s_order select * from s_order where 1=0 / select * from table(dbms_xplan.display) / -------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------- | 0 | INSERT STATEMENT | | 1 | 445 | 0 (0)| | | 1 | MULTI-TABLE INSERT | | | | | | | 2 | INTO | S_ORDER | | | | | |* 3 | FILTER | | | | | | | 4 | TABLE ACCESS FULL| S_ORDER | 11M| 5052M| 446K (1)| 02:13:56 | -------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - filter(NULL IS NOT NULL) explain plan for insert into s_order select * from s_order where 1=0 / select * from table(dbms_xplan.display) / ------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------ | 0 | INSERT STATEMENT | | 1 | 445 | 0 (0)| | |* 1 | FILTER | | | | | | | 2 | TABLE ACCESS FULL| S_ORDER | 11M| 5052M| 446K (1)| 02:13:56 | ------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(NULL IS NOT NULL)
So if you have a use for a filter on an insert statement that does not affect the optimizer’s execution plan (maybe for some esoteric optimizer problem?) then you could also use it for that.
Maybe there are more elegant solutions: after all this does require that you first make available a table with the correct columns and datatypes (with a CTAS and 1=0 condition, possibly). I bet there’s some other good ways though … ?
Posted in Oracle | 9 Comments »