250,000 Tests for Uniqueness Per Second — Ain’t No Biggie

When you’re designing ETL processes the mantra of “do it in SQL” realy gets engrained in your way of thinking. PL/SQL-based processing as a method of handling business logic is anathema, as the principle of “row-by-row equals slow-by-slow” is very well highlighted by the processing of millions of rows.

Very recently though I found a case where this principle had to be violated, and with performance results that were not at all unpleasant. I was stumped by a SQL problem that I thought would be amenable to analytic functions, but I couldn’t think of a way of doing it. Here’s the situation.

I have two sets of data representing, let us say, sources and targets. Sources and targets only have one attribute each, which is a date in each case, and sources and targets can be joined in a many-to-many relationship. However only one join is valid per source and only one is valid per target, and we have to form connections between sources and targets on a one-to-one basis. We work out which one of the possbily-many joins is the valid connection according to the following logic.

We start with the most recent source. We work out the earliest target to which it is joined, and discard all other targets to which it is joined (if any). The target to which it connected is now ineligible for any other source to connect to it. Then we take the next most recent source and we work through the targets in ascending target-date order and we connect to the first (ie. earliest) target available, discarding all others. The only reason why a target might not be available is that we already connected a source to it.

One way of looking at this is to imagine that the set of joined sources and targets are ordered by source-date descending and then target-date ascending. Examining each pair in turn we ask two questions:

  1. “Has this source already connected to a target?”
  2. “Has this target already connected to a source?”

If the answer to both of those is “No” then we take that join as the valid connection and we move on to the next pair.

It’s very easy to construct a sample data set for this problem as well, by simply joining a list of integers to itself to form pairs such as (1,1), (1,2), (1,3), … , (1000,1000) and ordering the output with a query such as this:

select s.num s_num,t.num t_num from 
(select rownum num from dual connect by level <= 1000) s, 
(select rownum num from dual connect by level <= 1000) t 
order by 1 desc, 2 asc 
/ 
 

Given that data set the first pair is (1000,1), which becomes the first connection. The next connection will be (999,2), the next will be (998,3) etc.. It’s important to note that the real data set is not like that though, where every source connects to every target, so special conditions in this data set do not apply to the real one.

By the way, there are some special conditions that are easy even in the real data – if a row represents the first occurance for both the source and the target then that is always going to be the valid connection. Um … that’s actually the only special condition I could think of. There may be some very clever mathematical way of working out others, but I don’t know what it is.

So, the first solution I came up with was to use PL/SQL to read each row in order and attempt to insert into a table having unique constraints on the source_id and target_id columns, handling any uniqueness error. The handling of the error is the only virtue that PL/SQL brings to this case. It was functional, but it was slow because of course it requires a lot of logical reads and writes in the index maintenance. I considered an insert with an error-logging clause by the way, but that doesn’t handle unique or primary key constraint errors and if it did it would still have the same problem of high logical reads and writes.

The next solution was to maintain a pair of PL/SQL associative arrays to handle the detection of uniqueness, and that worked well except that I was using conventional path inserts for the result set, and I didn’t fancy coding up bulk inserts. I know enough PL/SQL to be dangerous but it’s not my native language.

My (currently) final solution was to embed the PL/SQL into a package so that a function to detect uniqueness could be embedded in a sql statement. I was pretty concerned about the context switch to PL/SQL, and the logic means that this process can’t be parallelised, but the result suprised me. When I streamed 100,000,000 rows through the process to detect 10,000 valid connections the process completed in 770 seconds, so that was at least 250,000 uniqueness tests per second. I can’t imagine that being achieved based on detecting violations of a unique constraint.

Here’s the package I used. Sorry about the crappy formatting:

CREATE OR REPLACE PACKAGE ss 
is 
  TYPE NumTyp IS TABLE OF NUMBER INDEX BY pls_integer; 
  src_tab   NumTyp; 
  tgt_tab   NumTyp; 
  empty_tab numtyp; procedure reset; function is_new(s_num IN NUMBER, t_num NUMBER) 
return NUMBER; 
end; 
/ 
CREATE OR REPLACE PACKAGE body ss 
is 
PROCEDURE reset 
is 
BEGIN 
    src_tab := empty_tab; 
    tgt_tab := empty_tab; 
end; function is_new (s_num NUMBER, t_num NUMBER) 
return number 
is 
BEGIN 
   IF src_tab.exists(s_num) OR tgt_tab.exists(t_num) 
   then 
      RETURN 0; 
   else 
      src_tab(s_num) := 1; 
      tgt_tab(t_num) := 1; 
      RETURN 1; 
   end if; 
end; 
end; 
/ 
 

The complete implementation can be tested as follows:

 exec ss.reset; 
        
select * from        
(select s.num s_num,t.num t_num 
from   (select rownum num from dual connect by level <= 1000) s, 
       (select rownum num from dual connect by level <= 1000) t 
where rownum > 0 
order by 1 desc, 2 asc) 
where ss.is_new(s_num,t_num) = 1 
/

The call to ss.reset purges the arrays, and must be executed between runs. The rownum > 0 predicate is just the poor mans way of ensuring that the call to ss.is_new is not pushed into the inline view — Oracle obviously doesn’t know that the row ordering is a critical part of the logic.

Obviously an INSERT (with append hint to invoke direct path) can be prepended to that to perform the actual ETL task.

About these ads

9 thoughts on “250,000 Tests for Uniqueness Per Second — Ain’t No Biggie

  1. Could you give a more concrete example. I’m assuming that there is a defined relationship between source and target such that there are some source/target combinations that are not valid. Otherwise you could use analytics to enumerate the source in descending order and the target in ascending order and select out the rows where the source and target have the same number.

    WITH src_data AS (
    SELECT ROWNUM val
    FROM DUAL
    CONNECT BY LEVEL <= 1000
    ), tgt_data AS (
    SELECT ROWNUM + 77 val
    FROM DUAL
    CONNECT BY LEVEL <= 1000
    )
    SELECT *
    FROM (
    SELECT
    sd.VAL src_val,
    td.VAL tgt_val,
    ROW_NUMBER() OVER ( ORDER BY sd.VAL DESC ) src_num,
    ROW_NUMBER() OVER ( ORDER BY td.VAL ) tgt_num
    FROM SRC_DATA sd, TGT_DATA td
    ) L1
    WHERE L1.src_num = L1.tgt_num


  2. a@B> exec ss.reset;

    PL/SQL procedure successfully completed.

    Elapsed: 00:00:00.00
    a@B >
    a@B > select * from
    2 (select s.num s_num,t.num t_num
    3 from (select rownum num from dual connect by level <= 10) s,
    4 (select rownum num from dual connect by level 0
    6 order by 1 desc, 2 asc)
    7 where ss.is_new(s_num,t_num) = 1
    8 /

    S_NUM T_NUM
    ---------- ----------
    10 1
    9 2
    8 3
    7 4
    6 5

    Elapsed: 00:00:00.01
    a@B >
    a@B > with t as
    2 ( select s.num s_num,t.num t_num
    3 from (select rownum num from dual connect by level <= 10) s,
    4 (select rownum num from dual connect by level <= 5) t
    5 )
    6 ,j as
    7 (
    8 select s_num, dense_rank() over (order by s_num desc) rs
    9 ,t_num, dense_rank() over (order by t_num asc) rt
    10 from t
    11 --order by s_num desc, t_num asc
    12 )
    13 select s_num, t_num
    14 from j
    15 where rs = rt
    16 ;

    S_NUM T_NUM
    ---------- ----------
    10 1
    9 2
    8 3
    7 4
    6 5

    Elapsed: 00:00:00.01

  3. Yes, not all combinations are valid. So we have say 7 million sources and 8 million targets, and some subset of them join together through rather complex multivalue logic that is not relevant to this issue. We might average something like 3 potential targets for each source, so something like 21 million joins out of which we have to find the real connections — there might only be 2 million connections to be made because by the time you check the last of the sources all of their potential targets have already been connected-to by other sources.

    Here’s a crack at a better data set:

    drop table ss_set;

    create table ss_set (src_id number, src_date date, tgt_id number, tgt_date date);

    insert into ss_set values (1,date ’2007-06-01′,1,date ’2007-06-01′);
    insert into ss_set values (1,date ’2007-06-01′,2,date ’2007-06-10′);
    insert into ss_set values (1,date ’2007-06-01′,3,date ’2007-06-03′);
    insert into ss_set values (2,date ’2007-05-01′,1,date ’2007-06-01′);
    insert into ss_set values (2,date ’2007-05-01′,2,date ’2007-06-10′);
    insert into ss_set values (2,date ’2007-05-01′,4,date ’2007-05-03′);
    insert into ss_set values (3,date ’2007-07-01′,1,date ’2007-06-01′);
    insert into ss_set values (4,date ’2007-05-20′,2,date ’2007-06-10′);
    insert into ss_set values (4,date ’2007-05-20′,4,date ’2007-05-03′);
    insert into ss_set values (5,date ’2007-08-01′,1,date ’2007-06-01′);
    commit;

    So the ordered data set looks like this:

    select src_id,tgt_id
    from ss_set
    where rownum > 0
    order by src_date desc, tgt_date asc
    /

    SRC_ID TGT_ID
    ———————- ———————-
    5 1
    3 1
    1 1
    1 3
    1 2
    4 4
    4 2
    2 4
    2 1
    2 2

    And the final result should look like this:

    exec ss.reset;

    select * from
    (select src_id,tgt_id
    from ss_set
    where rownum > 0
    order by src_date desc, tgt_date asc)
    where ss.is_new(src_id,tgt_id) = 1
    /

    SRC_ID TGT_ID
    ———————- ———————-
    5 1
    1 3
    4 4
    2 2

    Note that source 3 has no valid connection because its only candidate connecton was to target 1, and source 5 already connected to that because it has a later source date.

    Hopefully this isn’t too simple a data set …

  4. Well, I only had my interpretation of what the real rules were.

    You were talking about two (hence independent) data sets with, seemingly, a many-to-many relationship which needed resolving. It wasn’t obvious there was also a pool of candidate connections … your sample data set seemed to be just that, sample data … and not sample data with a rule built into it.

    Also, with the old data set you were ordering and pushing into the memory stack the same pair of attributes. Now you’re ordering by a pair of attributes but push into the memory stack a different pair of attributes. Not exactly the same thing.

    Even with your new test case there seem to be built-in assumptions which may or may not be relevant. For example, update src_date and tgt_date, for all rows, to the same date. With that, the result of your query (and algorithm for that matter) becomes non-deterministic … but why that would not constitute a real data set, I don’t know.

    Having said that, I think your approach is quite valid and elegant. The query that I did was not substantially faster than your approach and, if anything, would get – given the extra rules required – only more complicated and slower.

    I cannot think of a linear sql solution … a solution here appears to require a recursive approach. To do that without pl/sql would need some sort of hierarchical query (that a connect by query is straight sql is questionable anyway). Given your volume, the hierarchical query would be slooooowww! Hence I would not even attempt that.

    Which brings me back to … someone may come up with a liner sql solution and I would be curious to see it; but I know it won’t be faster than what you have. As long as you don’t blow/page the user memory with those pl/sql arrays I think you’re good. Hopefully this matching is a one-time affair since any insert/update/delete has the potential to destabilize the whole thing (change the result and/or make it non-deterministic).

    The only thing I would do differently is to pipe the results and do away with the need to reset those global arrays, etc. It was also faster in my tests for 1 million rows though it may be different for your real data:


    create or replace type typ as object ( s number, t number );
    /
    show errors

    create or replace type arr as table of typ;
    /
    show errors

    create or replace function f return arr
    pipelined
    as
    type numArr is table of number index by binary_integer;
    s numArr;
    t numArr;
    begin
    for crs in ( select src_id, tgt_id from ss_set order by src_date desc, tgt_date asc)
    loop
    if s.exists(crs.src_id) or t.exists(crs.tgt_id) then
    null;
    else
    s(crs.src_id) := 1; t(crs.tgt_id) := 1;
    pipe row ( typ(crs.src_id, crs.tgt_id) );
    end if;
    end loop;
    return;
    end;
    /
    show errors

    a@B> select * from table(f);

    S T
    ---------- ----------
    5 1
    1 3
    4 4
    2 2

  5. Huzzah, model clause to the rescue.

    SELECT SRC_ID, TGT_ID
    FROM (
    SELECT SRC_ID, TGT_ID, MATCH
    FROM SS_SET
    MODEL
    DIMENSION BY (SRC_ID, TGT_ID)
    MEASURES (SRC_DATE, TGT_DATE, 0 MATCH)
    RULES (
    MATCH[ANY,ANY] ORDER BY SRC_DATE DESC, TGT_DATE ASC = CASE WHEN sum(MATCH)[ANY,CV()] = 0 AND sum(MATCH)[CV(),ANY] = 0 THEN 1 ELSE MATCH[CV(),CV()] END
    )
    )
    WHERE MATCH = 1

    I’m not sure how fast this will work with the type and volume of your data set, but it was fun trying to figure out how to do it in SQL.

  6. Wow … turns out that Fridays aren’t reserved for bad news all the time afterall. I always thought I ought to learn more about the model clause, and now you give me the perfect opportunity. I’m one of those who can’t learn something without a practical problem to solve. It’s a debilitating handicap.

    I’ll look further into this and see how it scales against a real multimillion row data set.

    Thanks very much indeed Bob. I’ll post back with results.

  7. I ran a quick test, Bob, but it wasn’t encouraging I’m afraid. I didn’t want to post anything on it till I was sure but it looked really really slow. I think it was something like 100 seconds versus 1 second …

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s