Sequences — Not The Only Choice

I quick comment of mine on the Oracle Forums prompts me to add a further note of explanation here.

Many of us are pretty much wedded to the use of sequences to generate primary key values, either by using a trigger to populate the column or by referencing the sequence next value in the insert statement itself. We do this because they provide concurrent inserts with unique values, and we don’t care about missing the occasional value here and there due to caching, rollback etc.

Well sequences are fine things for solving concurrency issues on primary key generation, but I’m struggling to recall a time when I has multiple processes inserting new values into a data warehouse table concurrently. I can see that it might happen on a fact table, though it would rule out the presence of enabled bitmap indexes, but for a dimension we are generally inserting values from a data source that we have found do not exist as dimension values.

Without the need to support concurrent inserts the need to use a sequence largely disappears, and we are left instead with the need to use an object whose definition on creation often has to be inconveniently derived from the current maximum value in some column or other.

So, I’ve largely stopped using sequences to generate primary key values. Instead I have code of the following form:

 CREATE OR REPLACE VIEW NEW_DIM_NODE_VW AS
  with current_max as
         (select   Coalesce(max(node#),0) max_key_value
          from     dim_node),
       new_values as
         (select   node_txt
          from     source_data
          minus
          select   node_txt
          from     dim_node
          order by 1)
select node_txt,
       rownum + max_key_value node#
from   current_max c,
       new_values  n;

The “current_max” subquery factoring clause (SQFC) gives an extremely fast read of the index on the primary key column (node#) of the dimension table to find the current maximum value, using Coalesce() to ensure that an empty dimension table in a newly deployed schema returns zero instead of NULL.

The “new_values” SQFC provides a list of new values, ordered in order to reduce (very slightly) the overhead of index maintenance on the unique constraint that protects the dim_node.node_txt column.

The actual load is performed by a packaged procedure of this form:

  procedure publish_new_nodes
  is
  begin
     ETL_Support.Start_Action('publish_new_nodes');
     ETL_Support.set_start_time;

     insert into dim_node(node#,node_txt)
     select      node#,
                 node_txt
     from        new_dim_node_vw;

 

     ETL_Support.set_end_time;

     ETL_Support.Write_Log_DML(
         plsql_unit   => $$PLSQL_UNIT,
         plsql_line   => $$PLSQL_LINE,
         action       => 'Insert (Conventional Singletable)',
         message      => 'Loaded '||to_char(SQL%RowCount,'fm999,999,990')
                         ||' new nodes to DIM_NODE',
         dml_row_count=> SQL%RowCount);

     ETL_Support.End_Action;  
  End publish_new_nodes;

Witness also the lovely use of $$PLSQL_UNIT and $$PLSQL_LINE to pass to the DML logging procedure the approximate code location for the activity.

Sweet.

One side effect of this is that the primary key values are gap-free, unless one was to delete rows other than those with the highest primary key values of course. I wonder if the optimiser does have any special way of treating columns in tables of X rows for which the maximum value is equal to the minimum value plus (X-1) and the number of distinct values is X? Probably not, but I like to think that it would put the cat among the pigeons if it did.

Anyway, the only exception I have to the use of this method in the current system I’m working on is for tables populated by an Oracle Gateway process. Primary key values for those are still populated by trigger and sequence — what I have been describing as “Old School” to the imaginary critic in my head.

Aside from those, I’m done with sequences for now.