Sequences — Not The Only Choice

I quick comment of mine on the Oracle Forums prompts me to add a further note of explanation here.

Many of us are pretty much wedded to the use of sequences to generate primary key values, either by using a trigger to populate the column or by referencing the sequence next value in the insert statement itself. We do this because they provide concurrent inserts with unique values, and we don’t care about missing the occasional value here and there due to caching, rollback etc.

Well sequences are fine things for solving concurrency issues on primary key generation, but I’m struggling to recall a time when I has multiple processes inserting new values into a data warehouse table concurrently. I can see that it might happen on a fact table, though it would rule out the presence of enabled bitmap indexes, but for a dimension we are generally inserting values from a data source that we have found do not exist as dimension values.

Without the need to support concurrent inserts the need to use a sequence largely disappears, and we are left instead with the need to use an object whose definition on creation often has to be inconveniently derived from the current maximum value in some column or other.

So, I’ve largely stopped using sequences to generate primary key values. Instead I have code of the following form:

 CREATE OR REPLACE VIEW NEW_DIM_NODE_VW AS
  with current_max as
         (select   Coalesce(max(node#),0) max_key_value
          from     dim_node),
       new_values as
         (select   node_txt
          from     source_data
          minus
          select   node_txt
          from     dim_node
          order by 1)
select node_txt,
       rownum + max_key_value node#
from   current_max c,
       new_values  n;

The “current_max” subquery factoring clause (SQFC) gives an extremely fast read of the index on the primary key column (node#) of the dimension table to find the current maximum value, using Coalesce() to ensure that an empty dimension table in a newly deployed schema returns zero instead of NULL.

The “new_values” SQFC provides a list of new values, ordered in order to reduce (very slightly) the overhead of index maintenance on the unique constraint that protects the dim_node.node_txt column.

The actual load is performed by a packaged procedure of this form:

  procedure publish_new_nodes
  is
  begin
     ETL_Support.Start_Action('publish_new_nodes');
     ETL_Support.set_start_time;

     insert into dim_node(node#,node_txt)
     select      node#,
                 node_txt
     from        new_dim_node_vw;

 

     ETL_Support.set_end_time;

     ETL_Support.Write_Log_DML(
         plsql_unit   => $$PLSQL_UNIT,
         plsql_line   => $$PLSQL_LINE,
         action       => 'Insert (Conventional Singletable)',
         message      => 'Loaded '||to_char(SQL%RowCount,'fm999,999,990')
                         ||' new nodes to DIM_NODE',
         dml_row_count=> SQL%RowCount);

     ETL_Support.End_Action;  
  End publish_new_nodes;

Witness also the lovely use of $$PLSQL_UNIT and $$PLSQL_LINE to pass to the DML logging procedure the approximate code location for the activity.

Sweet.

One side effect of this is that the primary key values are gap-free, unless one was to delete rows other than those with the highest primary key values of course. I wonder if the optimiser does have any special way of treating columns in tables of X rows for which the maximum value is equal to the minimum value plus (X-1) and the number of distinct values is X? Probably not, but I like to think that it would put the cat among the pigeons if it did.

Anyway, the only exception I have to the use of this method in the current system I’m working on is for tables populated by an Oracle Gateway process. Primary key values for those are still populated by trigger and sequence — what I have been describing as “Old School” to the imaginary critic in my head.

Aside from those, I’m done with sequences for now.

About these ads

11 thoughts on “Sequences — Not The Only Choice

  1. This can only work correctly when there is only one session active. If multiple sessions use this code at the same time and/or the table is being updated concurrently then the maxval will not show the correct value.
    Quote from OCA Oracle Database 11g: SQL Fundamentals I:
    The sequence mechanism is independent of tables, the row locking mechanism,
    and commit or rollback processing. This means that a sequence can issue thousands
    of unique values a minute—far faster than any method involving selecting a column
    from a table, updating it, and committing the change.
    Figure 12-4 shows two sessions selecting values from a sequence SEQ1.
    Note that in the figure each selection of SEQ1.NEXTVAL generates a unique
    number. The numbers are issued consecutively in order of the time the selection was
    made, and the number increments globally, not just within one session.

  2. Well, one huge disadvantage I see with this approach is that rownum cannot be used effectively with parallel queries. It constitutes a serialization point, because rows have to be numbered sequentially.

    This is the main reason I never use rownum for generating IDs.

    • Florian,

      Yes that’s probably the case for parallel insert, and a sequence would be appropriate there.

      With serial insert based on parallel select (for example when finding a relatively small number of new dimension values based on a large number of fact rows) I think that the rownum would be assigned by the (serial) process that is performing the inserts, hence no problem. Not tested mind, but I see no issue with it.

  3. Patrick,

    Indeed, but as I mentioned “I’m struggling to recall a time when I has [sic] multiple processes inserting new values into a data warehouse table concurrently”.

    For a dimension table the logic is usually “insert new values that we haven’t seen before”. This logic is generally not amenable to multiple processes maintaining it concurrently as each process could not tell if another process was already inserting “Billy-Bob Joe”. That’s not to say that there couldn’t be specific examples where values are guaranteed not to be shared among multiple insert processes, but in my experience they’d be a very small minority.

  4. Just curious; why have you picked COALESCE rather than NVL to do the null check? To me, COALESCE is the more obscure option and I would have used NVL instead, to save later developers scratching their heads and going “Why use COALESCE here?!”

    I’m also curious as to whether you have done any performance testing of the two methods and if so, did you find the non-concurrent, non-sequence method faster?

    I don’t think I’ll stop using sequences any time soon; I’d rather not run the risk of load failures of my dimension tables, however small that risk might be! *{;-)

    • I have developed a fondness for Coalesce(), I’m not sure why.

      I do like the way that it extends to more than two input expressions and really takes on the sense of “Pick the first of these that is not null”, eg. Coalesce(first_value, second_value, third_value, ‘default’). If I’m going to use it for that then I might as well use it for the more simple case of two expressions where people commonly use Nvl().

      I think that Nvl() is also oracle specific, and Coalesce() is more widely used by other RDBMS’s.

      I also seem to remember reading that Nvl() always evaluates the second expression, which is a bit wasteful if it is a PL/SQL function call or a complex RegExp or something similar. I may be wrong about that though.

      So, I could be using Nvl() by default and only adopting Coalesce() when one of those issues is valid, or use Coalesce() everywhere. At the moment I’m doing the latter.

      As far as risk on the PK generation method is concerned, I’ve been bitten at least twice by incorrectly synched sequences following a system deployment. One of those was in a case where a sequence was shared across multiple tables and the query used to generate the correct starting value for the sequence missed one of them out. The same deployment code had been used for a long time without the problem manifesting itself. I think we changed the code to be based on user_dependencies after that because all keys were generated based on triggers.

      Anyway, the nice thing is that we all have our own choice on what to do.

      • Good point on the COALESCE being more universal than the NVL; I hadn’t considered that side of things (I’ve only ever worked with Oracle *{;-) ).

        A few quick tests seem to indicate that using NVL vs COALESCE doesn’t give much performance difference; maybe a second or two difference over 100000 loops of v_num := nvl/coalesce(1 or NULL, 2);

        Nothing consistent though; sometimes NVL was quicker, sometimes COALESCE, and there was no noticable difference when I did (null, 2) vs (1, 2) either.

        Anyway, the nice thing is that we all have our own choice on what to do.

        True, and as long as it’s commented, then there shouldn’t be a maintenance headache either *{;-)

  5. Sequences are indeed the way to go with a parallel load process…. but are not without their quirks …. such as on Exadata RAC… but that is just tuning :-)

    I too use coalesce – one plus (I think, but may be wrong) is that it picks the first non-null reading from the left – if it finds a non-null it does not evaluate the remaining options; NVL always evaluates both options… which can be expensive if the second choice is a costly look-up…. but then this may just be an urban myth

    • aaaaha! I’ve just modified my test to:


      set timing on;

      declare
      v_num number;
      function sleep return number is
      begin
      dbms_lock.sleep(2);
      return 2;
      end;
      begin
      for i in 1..100
      loop
      v_num := nvl(1, sleep);
      end loop;
      end;
      /
      PL/SQL procedure successfully completed.
      Elapsed: 00:03:21.00

      declare
      v_num number;
      function sleep return number is
      begin
      dbms_lock.sleep(2);
      return 2;
      end;
      begin
      for i in 1..100
      loop
      v_num := coalesce(1, sleep);
      end loop;
      end;
      /
      PL/SQL procedure successfully completed.
      Elapsed: 00:00:00.04

      declare
      v_num number;
      function sleep return number is
      begin
      dbms_lock.sleep(2);
      return 2;
      end;
      begin
      for i in 1..100
      loop
      v_num := nvl(null, sleep);
      end loop;
      end;
      /
      PL/SQL procedure successfully completed.
      Elapsed: 00:03:21.00

      declare
      v_num number;
      function sleep return number is
      begin
      dbms_lock.sleep(2);
      return 2;
      end;
      begin
      for i in 1..100
      loop
      v_num := coalesce(null, sleep);
      end loop;
      end;
      /
      PL/SQL procedure successfully completed.
      Elapsed: 00:03:21.00

      So I’ve learnt something today – thanks *{:-D

      • @Boniest
        Thanks for running that test – I am getting lazy in old age…. no, I don’t have database here :-)

        It’s nice to know that I actually remembered something…

        @Dave ;-)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s