When you’re designing ETL processes the mantra of “do it in SQL” realy gets engrained in your way of thinking. PL/SQL-based processing as a method of handling business logic is anathema, as the principle of “row-by-row equals slow-by-slow” is very well highlighted by the processing of millions of rows.
Very recently though I found a case where this principle had to be violated, and with performance results that were not at all unpleasant. I was stumped by a SQL problem that I thought would be amenable to analytic functions, but I couldn’t think of a way of doing it. Here’s the situation.
I have two sets of data representing, let us say, sources and targets. Sources and targets only have one attribute each, which is a date in each case, and sources and targets can be joined in a many-to-many relationship. However only one join is valid per source and only one is valid per target, and we have to form connections between sources and targets on a one-to-one basis. We work out which one of the possbily-many joins is the valid connection according to the following logic.
We start with the most recent source. We work out the earliest target to which it is joined, and discard all other targets to which it is joined (if any). The target to which it connected is now ineligible for any other source to connect to it. Then we take the next most recent source and we work through the targets in ascending target-date order and we connect to the first (ie. earliest) target available, discarding all others. The only reason why a target might not be available is that we already connected a source to it.
One way of looking at this is to imagine that the set of joined sources and targets are ordered by source-date descending and then target-date ascending. Examining each pair in turn we ask two questions:
“Has this source already connected to a target?”
“Has this target already connected to a source?”
If the answer to both of those is “No” then we take that join as the valid connection and we move on to the next pair.
It’s very easy to construct a sample data set for this problem as well, by simply joining a list of integers to itself to form pairs such as (1,1), (1,2), (1,3), … , (1000,1000) and ordering the output with a query such as this:
select s.num s_num,t.num t_num from (select rownum num from dual connect by level <= 1000) s, (select rownum num from dual connect by level <= 1000) t order by 1 desc, 2 asc /
Given that data set the first pair is (1000,1), which becomes the first connection. The next connection will be (999,2), the next will be (998,3) etc.. It’s important to note that the real data set is not like that though, where every source connects to every target, so special conditions in this data set do not apply to the real one.
By the way, there are some special conditions that are easy even in the real data — if a row represents the first occurance for both the source and the target then that is always going to be the valid connection. Um … that’s actually the only special condition I could think of. There may be some very clever mathematical way of working out others, but I don’t know what it is.
So, the first solution I came up with was to use PL/SQL to read each row in order and attempt to insert into a table having unique constraints on the source_id and target_id columns, handling any uniqueness error. The handling of the error is the only virtue that PL/SQL brings to this case. It was functional, but it was slow because of course it requires a lot of logical reads and writes in the index maintenance. I considered an insert with an error-logging clause by the way, but that doesn’t handle unique or primary key constraint errors and if it did it would still have the same problem of high logical reads and writes.
The next solution was to maintain a pair of PL/SQL associative arrays to handle the detection of uniqueness, and that worked well except that I was using conventional path inserts for the result set, and I didn’t fancy coding up bulk inserts. I know enough PL/SQL to be dangerous but it’s not my native language.
My (currently) final solution was to embed the PL/SQL into a package so that a function to detect uniqueness could be embedded in a sql statement. I was pretty concerned about the context switch to PL/SQL, and the logic means that this process can’t be parallelised, but the result suprised me. When I streamed 100,000,000 rows through the process to detect 10,000 valid connections the process completed in 770 seconds, so that was at least 250,000 uniqueness tests per second. I can’t imagine that being achieved based on detecting violations of a unique constraint.
Here’s the package I used. Sorry about the crappy formatting:
CREATE OR REPLACE PACKAGE ss is TYPE NumTyp IS TABLE OF NUMBER INDEX BY pls_integer; src_tab NumTyp; tgt_tab NumTyp; empty_tab numtyp; procedure reset; function is_new(s_num IN NUMBER, t_num NUMBER) return NUMBER; end; / CREATE OR REPLACE PACKAGE body ss is PROCEDURE reset is BEGIN src_tab := empty_tab; tgt_tab := empty_tab; end; function is_new (s_num NUMBER, t_num NUMBER) return number is BEGIN IF src_tab.exists(s_num) OR tgt_tab.exists(t_num) then RETURN 0; else src_tab(s_num) := 1; tgt_tab(t_num) := 1; RETURN 1; end if; end; end; /
The complete implementation can be tested as follows:
exec ss.reset; select * from (select s.num s_num,t.num t_num from (select rownum num from dual connect by level <= 1000) s, (select rownum num from dual connect by level <= 1000) t where rownum > 0 order by 1 desc, 2 asc) where ss.is_new(s_num,t_num) = 1 /
The call to ss.reset purges the arrays, and must be executed between runs. The rownum > 0 predicate is just the poor mans way of ensuring that the call to ss.is_new is not pushed into the inline view — Oracle obviously doesn’t know that the row ordering is a critical part of the logic.
Obviously an INSERT (with append hint to invoke direct path) can be prepended to that to perform the actual ETL task.