I've been fiddling around with creating some simulations of data warehouse summary data, and one issue I have to deal with is to get sufficient rows in the summary for the set to be useful while still avoiding duplicate key values.

Why do duplicates matter so much? One critical scenario is when you rely on a merge to maintain the summary data. If there are duplicate key values in the summary data that match the key for a new row being merged in then the increment in the metric value of the new data is applied to multiple summary table rows. Uh oh.

To generate the data set I generally use a query of the form …

create table data_summary ( date_of_month not null, location_cd not null, product_cd not null, s_product_qty not null, s_product_cost not null ) as select Add_Months(date '2000-01-01',mod(rownum-1,77)), mod(rownum-1,100)+1, mod(rownum-1,1000)+1, floor(dbms_random.value(1, 30)), floor(dbms_random.value(1,100)) from dual connect by level <= 7700000 /

Note here the difference between mod(rownum,10) and mod(rownum-1,10)+1.

The problem is knowing how many rows I can generate for this data set, in which the three key columns have 77, 100 and 1000 unique values generated cyclically, before I hit a duplicate and the data is no longer a valid simulation. In the above example I am generating 7,700,000 rows, which is 77*100*1000 — too much? In this case it all revolves around the lowest common multiple of the cardinalities of the key columns.

Well … through Wikipedia I found that the Lowest Common Multiple can be calculated based upon the Greatest Common Divisor, and I also found the Euclidean Algorithm for GCD.

Thus …

create or replace function gcd(a number, b number)

return number is begin if b = 0 then return a; else return gcd(b,mod(a,b)); end if; end; / create or replace function lcm(a number, b number) return number is begin return (a*b)/gcd(a,b); end; /

And in action this gives us …

SQL> create or replace function gcd(a number, b number) 2 return number 3 is 4 begin 5 if b = 0 then 6 return a; 7 else 8 return gcd(b,mod(a,b)); 9 end if; 10 end; 11 / Function created. SQL> SQL> create or replace function lcm(a number, b number) 2 return number 3 is 4 begin 5 return (a*b)/gcd(a,b); 6 end; 7 / Function created. SQL> select gcd(8,12) from dual 2 / GCD(8,12) ---------- 4 SQL> select lcm(15,12) from dual 2 / LCM(15,12) ---------- 60

So, where is this leading?

Well the answer here is that 7,700,000 rows is indeed too many because it exceeds the lowest common multiple of 77, 100 and 1,000, which is 77,000.

SQL> select lcm(77,lcm(100,1000)) from dual; LCM(77,LCM(100,1000)) --------------------- 77000

So now, if I want to increase the number of rows I am using without violating a unique constraint then I can just experiment with a few values …

SQL> select lcm(79,lcm(101,78001)) lcm from dual; LCM ---------- 622369979 SQL> select lcm(77,lcm(100,1000)) lcm from dual; LCM ---------- 77000 SQL> select lcm(77,lcm(100,1001)) lcm from dual; LCM ---------- 100100 SQL> select lcm(77,lcm(101,1001)) lcm from dual; LCM ---------- 101101 SQL> select lcm(77,lcm(101,77001)) lcm from dual; LCM ---------- 598836777 SQL> select lcm(77,lcm(99,77001)) lcm from dual; LCM ---------- 17787231

If seventeen million is about what I'm looking for, then here is my final query;

create table data_summary ( date_of_month not null, location_cd not null, product_cd not null, s_product_qty not null, s_product_cost not null ) as select Add_Months(date '2000-01-01',mod(rownum-1,77)), mod(rownum-1,99)+1, mod(rownum-1,77001)+1, floor(dbms_random.value(1, 30)), floor(dbms_random.value(1,100)) from dual connect by level <= 17787231 /

You can see a validation of this result here. in which using 17,787,23**1** results in no duplicate key combination, whereas 17,787,23**2** shows two duplicate rows.

This is not entirely representative of a real data set of course. In practice not all key combinations are usually represented in the fact table and therefore they are not represented in the summary. To simulate this we can filter using a DBMS_Random result, such as in the following example:

create table data_summary ( date_of_month not null, location_cd not null, product_cd not null, s_product_qty not null, s_product_cost not null ) as select * from ( Select Add_Months(date '2000-01-01',mod(rownum-1,77)), mod(rownum-1,99)+1, mod(rownum-1,77001)+1, floor(dbms_random.value(1, 30)), floor(dbms_random.value(1,100)) from dual connect by level <= 17787231 ) where dbms_random.value(0,100) < 75 /

Here we've reduced the row count by 25%, but I feel that it improves the realism of the simulation.

I tend to go for prime numbers to do this sort of thing. The closest I could get to your original set (79,100,1000) would be (77, 101, 997) for a product of 7955063.

A quick hack at Sieve of Eratosthenes to get them (I hope the tags work, or this may be a mess – where’s the ‘preview’ button?):

create or replace procedure sieve(i_limit in number)

as

type t_type is table of boolean index by binary_integer;

t t_type;

begin

for i in 1..i_limit loop

t(i) := true;

end loop;

for i in 2..trunc(sqrt(i_limit)) loop

for j in 2 .. trunc(i_limit/i) loop

if t(i*j) then

t(i * j) := false;

end if;

end loop;

end loop;

for i in 1..i_limit loop

if t(i) then

dbms_output.put_line(i);

end if;

end loop;

end;

/

execute sieve(1000)

*ahem*

Far be it from be to correct an ex-math teacher, but you’re not suggesting that 77 is a prime number are you Mr Lewis? *transfixes-with-beady-eye* I expect that was a typo … the 79 and 77 in the wrong places.

Yes, prime numbers are generally the thing to aim for — makes the arithmetic for the LCM rather trivial also. I sometimes find them to be inconvenient, for example where I want an integer number of years and hence need 12, 24, or 36 months for example.

So when can we expect to see your website tarted up a little, Jonathan? Do we need to make it a community effort to migrate all those articles?

Good observation, and correct deduction.

For 12, 24, 36 months, I can’t help noticing that you can still stick with prime numbers for the rest provided you avoid 2 and 3 (and 5 if you want 60 months … extrapolation for other numbers of years left as an exercise).

1. Buatlah suatu FUNCTION yang akan me-return nilai boolean (TRUE atau FALSE) berdasarkan parameter ID pegawai. Function akan bernilai TRUE jika ID pegawai ada di tabel EMPLOYEES. Sebaliknya akan bernilai FALSE jika ID yang diberikan tidak ada di tabel.

2. Buatlah suatu blok program PL/SQL yang akan menginputkan ID pegawai (employee_id) dan selanjutnya akan mengambil seluruh data (record) sesuai dengan ID yang diinputkan, lalu mencetak semuanya di layar monitor !