The Three Pillars of Oracle Data Warehousing

Introduction

This is a basic topic for Oracle data warehousing beginners, based on some ideas that I’m hoping will stop buzzing round in my head if I commit them to virtual paper.

There are three Oracle features that provide a foundation for successful data warehousing:

  • Partitioning
  • Parallelism
  • Psummary Tables (the “p” is silent)

Here are the benefits that they bring to the system. Continue reading

My Dishwasher is List Partitioned

My dishwasher has two levels in it. The bottom level is for plates and Large Things, the top level is for cups and Small Things, and there is a removable cutlery partition attached to the door.

If I had a spare cutlery partition then I could keep one of them in use by the dishwashing system while the other one is treated as a regular cutlery holder, and it could be filled with dirty cutlery as the items become available. When it was full I could perform a cutlery partition exchange between the one with clean items and the one now full of dirty items.

If I had spare levels then they could be treated in a similar manner. I believe that commercial dishwashers have exactly that configuration, thus they operate with lower downtime because of this exchange mechanism, although the overall configuration requires more space.

Within the cutlery partition there are six subpartitions. I like to fill each one with a single type of cutlery — one for knives, two for spoons (they don't stack as well), a couple for forks, and one for other items. Although it is more work to separate the items into these subpartitions it has the advantage of physically clustering all the spoons together and I can access all items of one type without having to scan the complete cutlery partition.

For the upper and lower levels similar principles apply, although they are not really subpartitioned in the same way. Instead the large plates are clustered in a single contiguous range — the small plates, the glasses and the mugs each have their own place. Again it is more work to insert the items like this, but the advantage of faster retrieval is similar because I don't have to scan the complete level to pick similar items out from dissimilar ones.

That is all.

Predicate Pushing And Analytic Functions

I'm sure that there must be a fair number of Oracle professionals who carry around in their heads a little score card of some of their best tuning results ever … hopefully, at least, 'cos otherwise I'm a weirdo. Anyway, today I improved the performance of a set of decision support queries and achieved my best result ever – improving query speed by a factor of 180,000, from an original run time of one hour down to a new time of 0.02 seconds.

The wise monkeys in the audience will immediately be thinking "partition pruning!" or "materialized view!", and in fact if you thought the former then you'd be right. Here's how it worked.

I had defined a view against a large fact table so that it included several analytic functions. The details really don't matter, but the intention was to allow a column to contribute to a metric value only once per transaction_id, so the metric definition was something like:

Decode(Row_Number() Over (Partition By Transaction_ID Order By 1),1,Qty,0)

Queries that access this view had predicates on a number of columns, including some very selective indexed columns (Item_ID) and a partition key column (Fiscal_Month). Unfortunately there is nothing in the database to tell the optimizer that each unique value of Transaction_ID had but a single value Item_ID and Fiscal_Month, so logically the predicates could not be applied until after the analytic function had been calculated. Hence there was no predicate pushing on the fiscal_month and item_id, and neither partition pruning nor index access was considered. The query was actually scanning about one quarter of the table (it looks like a combination of subpartition pruning and partition pruning was taking place, but this table is multicolumn range + list composite partitioned, and pruning at the partition level was only taking place on the leading column of the partition key).

However, we included the two predicate columns in the analytic functions' partition clause like so:

Decode(Row_Number() Over (Partition By Transaction_ID, Item_Id, Fiscal_Month Order By 1),1,Qty,0)

Now this doesn't change the result because the Item_Id and Fiscal_Month entries are actually logically redundant, but it did allow both 9i and 10g to push the predicates and give full partition and subpartition pruning and index-based access.

Quite a nice result.

Here's a script that I used to demonstrate that it would work.

drop table t1;
create table t1
   (
   txid   ,
   month  ,
   item_cd,
   qty
   )
as
select
   floor(rownum/5),
   floor(rownum/20),
   floor(rownum/10),
   floor(rownum/5)
from
   dual
connect by
   level
In 10g the explain plan with predicate pushing was very straightforward:
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------

Plan hash value: 2273146475

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |     9 |   252 |     3  (34)| 00:00:01 |
|   1 |  VIEW               |      |     9 |   252 |     3  (34)| 00:00:01 |
|   2 |   WINDOW SORT       |      |     9 |   351 |     3  (34)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL| T1   |     9 |   351 |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - filter("ITEM_CD"=0)

Note the filter being applied to line 3. Where the predicate was not pushed it was applied to line 1.
The 9i explain plan was rather ... funky. But it worked.

ORA-12034: materialized view log younger than last refresh

Dabblers in the art of the materialisation of views may be used to seeing this error message, and it's often a portent of doom. Metalink Note 204127.1 tackles it head on with a list of possible causes, but is remiss in advising that a complete refresh is required.

And thank goodness for that. A multi-gigabyte materialized view based on a multi-multi-gigabyte master table is no joke to perform a complete refresh on, and anything we can do to avoid it is welcome.

As far as data warehousing is concerned we can trim the fat out of the Metalink note to shortlist our potential causes:

  • Previous refresh failed
  • Modifying the master table so that changes don't go to the materialized view log (truncate, alter partitions … drop / truncate … etc)
  • Master table reorganization

And here is another prime cause that the note doesn't identify:

  • Issuing an MV log purge.

All of these potential causes appear to be easily fixable through the simple mechanism of dropping and recreating the materialized view. But I'm not talking about dropping the data here. One of the requirements for this method to be successful is that the materialized view was created on a prebuilt table. If this is the case, and it is possible to find out if it is or not based on the BUILD_MODE column of USER_MVIEWS, then in the event of a refresh failure we can:

  1. Drop the materialized view.
  2. Do whatever is required (if anything) to bring the master table and the MV back into synchronization (delete data from master, or add data to MV).
  3. Recreate the MV, again using the ON PREBUILT TABLE clause obviously.
  4. Erm …
  5. That's it.

Doubtless you script hounds will be wanting to see a test of this, huh?

create table m (col1 number);

create materialized view log on m
with rowid (col1)
including new values;

create table
     m_mv_cstar
as
select count(*) c_star
from   m;

create materialized view m_mv_cstar
 on prebuilt table
 refresh fast
 on commit
as
select count(*) c_star
from   m;

insert into m values (1);

commit;

That gets us a successfully fast refreshing materialized view. Now we break it.

drop materialized view log on m;

create materialized view log on m
with rowid (col1)
including new values;

insert into m values (1);

commit;

That ought to get you the ORA-12034 error, and this is the time to go and check whether any other materialized views has successfully refreshed and whether the change to the master table is still intact or whether that has been rolled back. In this case the change to m ought to have been rolled back.

So now we fix the problem of the ORA-12034 without a complete refresh of the materialized view.

drop  materialized view m_mv;

create materialized view m_mv
  on prebuilt table
  refresh fast
  on commit
as
select count(*) c_star
from   m;

If by some chance you had mislaid the definition of your materialized view then use the following to retrieve it.

Set Long 100000 Pagesize 0

Select  DBMS_Metadata.Get_DDL('MATERIALIZED_VIEW','M_MV_CSTAR') From Dual;

Note the underscore in the MATERIALIZED_VIEW there. Forget that and you'll receive a baffling:

1* Select  DBMS_Metadata.Get_DDL('MATERIALIZED VIEW','M_MV_CSTAR') From Dual
SQL> /
ERROR:
ORA-31600: invalid input value MATERIALIZED VIEW for parameter OBJECT_TYPE in
function GET_DDL
ORA-06512: at "SYS.DBMS_METADATA", line 2576
ORA-06512: at "SYS.DBMS_METADATA", line 2627
ORA-06512: at "SYS.DBMS_METADATA", line 4220
ORA-06512: at line 1

So have we successfully recovered from the error?

insert into m values (1);

commit;

select * from m;

select * from m_mv_cstar;

Hopefully that demonstrates that we have.

So, is there a downside to this? Well if you have never run a complete refresh on your materialized view, or created it with a "build immediate" clause, then you will not get the comfort of a STALENESS of "FRESH" and will instead be stuck with the disturbing "UNKNOWN". Relax — it's just that although you know that the materialized view is fresh, Oracle does not trust you. But then there are so many events that can break the "FRESH" staleness of your materialized view that you should treat FRESH MV's as a rare treat and not get used to them.

Designs That Balance ETL and Report Performance

It seems to me that there is sometimes a very difficult balance to strike, between optimizing the structure of a database for fast and convenient ETL loading and fast and convenient report querying.

Normally the choice of how to partition a fact table is pretty straightforward, because the data being loaded arrives in an order compatible with querying — ie. all of Tuesday’s transaction are available to the data warehouse ETL process early on Wednesday morning, and during the day the users will be running reports with such filters as “Give me Tuesday’s data”. Thus you can partition on a daily basis and accomodate both a daily ETL process of loading via partition exchange, and a daily reporting process for which the optimizer can invoke partition pruning.

So far so good — if the units in which the data becomes available is also a unit by which the end users want to report then the decision on partitioning is simple. Continue reading

Dates as Key Values: Natural or Synthetic?

Here's a question: when deciding whether to use a date datatype as a key value, why does there seem to be some consensus that it is OK to do so, even in situations where the same person would not consider using a country code, for example? Now I'm not talking about whether people should use a date datatype or should instead represent a date as a character string or number, because for me that is not an issue. If you want to store a date then you use a date datatype.

But how about a situation where you are using a date value as a key, for example in a data warehouse fact table? Is it "OK" to use the date, or should you be using a synthetic key? This has always been a bit of a philosophical conundrum for me, because I always think, "Yeah, dates are OK as key values — no need for a synthetic there" but have not put my finger on the reason why.

But recent discussions have firmed up some thoughts on the "true nature" of a synthetic key, and I had a little epiphany on the subject in response to a question by Pando at DBASupport.com.

I started with two points:

  1. The difference between a natural key and a synthetic key is that a natural key has some meaning to the naked eye in isolation, and therefore can be exposed directly to the user and "generated" by users. However a synthetic key's meaning is only revealed through translation to a natural value.
  2. If you change the value of a natural key, you are generally not changing what that key value represents, eg. "Finland", you are just changing it's means of representation — ie. some visible code value.

Now what is a date value in Oracle? It is merely some internal representation that correlates one-to-one with that particular date. Let me float the idea that a date column is actually a synthetic value, but appears to be natural because Oracle uses one of a great many possible conversions to represent it as such. So if your boss came in one morning and said, "Bad news lads, the business is changing the standard date format to be Julian", or you wanted to suddenly start representing month names in … um … Urdu, then just as in the case where a country changes its code we're not saying that the date itself has changed, only that its "human-friendly" representation has. You can do this pretty simply through changing NLS settings — if you have quality code that always uses a date format to say such things as:

Where date_of_transaction = To_Date('01-02-2005','DD-MM-YYYY') *

… then you generally don't even have to change that.

So here's my proposition: date-based keys are really synthetic, and that is why representing them with a regular sequence-generated synthetic value makes no sense — it's the representation of them, with year-month-day or whatever, that is natural, and Oracle is effectively maintaining an internal and very flexible two-way translation table for you, accessed through the to_date() and to_char() functions.

Now one difference between dates as synthetic values and other synthetics is that every Oracle database is using the same internal representation for the same date — in a sense the translation table is global withing the Oracle world. So when moving these raw internal values between different Oracle databases (through tablespace transport, for example) there is not the usual trouble with synthetics of needing a translation to the local system of representation. On the other hand, I'd be willing to lay a couple of $'s on the line that converting from Oracle to other RDBMS's does indeed require a translation of the internal value, possibly by converting to and from a natural representation.

* There was a formatting error in the original post that dropped the RHS of this predicate.

Statistics By Block Sampling

If you're just joining us, I'd suggest that you scroll down and read the original article then the follow-ups from the bottom upwards.

Update 03 Jun 2005

Another consideration raised, on the uniformity of the number of rows per block.

One reason why the test data showed good resilience towards estimation of statistics, for block sampling in particular, might be a lack of variation in the number of rows per block. It occurs to me that for a table with pretty regular row length that is not subject to deletes then block sampling of 1% is going to sample around 1% of rows, even if the sampled rows are clustered in a small number of sampled blocks.

So I ran a quick test to see what the distribution of rows-per-block is like in the table, and came up with the following query:

select rows_per_block,
count(*) blocks,
sum(rows_per_block) sum_rows
from
(
select dbms_rowid.rowid_block_number(rowid),
count(*) rows_per_block
from t1
group by dbms_rowid.rowid_block_number(rowid)
)
group by rows_per_block
order by 1 desc
/

ROWS_PER_BLOCK BLOCKS SUM_ROWS
————– ———- ———-
345 1 345
337 2 674
336 3 1008
324 11 3564
323 42 13566
322 58 18676
321 5 1605
320 1 320
319 1 319
315 1 315
314 1 314
313 3 939
312 3 936
311 6 1866
310 115 35650
309 503 155427
308 363 111804
307 38 11666
306 3 918
305 1 305
304 2 608
302 3 906
301 2 602
300 5 1500
299 3 897
298 26 7748
297 475 141075
296 1260 372960
295 378 111510
294 6 1764
213 1 213
———- ———-
sum 3322 1000000

Or to summarise:

select to_char(floor(rows_per_block/10)*10,'fm990')||'''s' rows_per_block,
count(*) blocks,
sum(rows_per_block) sum_rows
from
(
select dbms_rowid.rowid_block_number(rowid),
count(*) rows_per_block
from t1
group by dbms_rowid.rowid_block_number(rowid)
)
group by to_char(floor(rows_per_block/10)*10,'fm990')||'''s'
order by 1 desc

ROWS_P BLOCKS SUM_ROWS
—— ———- ———-
340's 1 345
330's 5 1682
320's 117 37731
310's 130 40339
300's 920 283736
290's 2148 635954
210's 1 213
———- ———-
sum 3322 1000000

Now bear in mind that it's implicit in this query that we're not including empty blocks in there. According to the table statistics based on a compute, the table has 0 empty blocks.

So the worst-case scenarios for a 1% block sampling are that it reads the blocks with either the smallest or the greatest number of rows per block. In this case, that's not going to be a huge disparity. So if the 33 blocks sampled were those with the smallest amount of rows, then Oracle might be sampling 9,647 rows, and if they were the blocks with the largest number of rows then it would be sampling 10,759 rows. Bear in mind also that both the number of rows for the table and the average row length suffer similar errors, although they would be inversely related (over estimation of number of rows would correlate to under estimation of average row length).

It's not too bad though. This means that for this particular table the maximum over estimation of row count based on a 1% block sample is 4.5%.

There's one other consideration, however. A 1% block sample does not guarantee that 1% of the blocks will be read. As the documentation states, "This percentage indicates the probability of each row, or each cluster of rows in the case of block sampling, being selected as part of the sample. It does not mean that the database will retrieve exactly sample_percent of the rows of table."

We can illustrate this easily by a query:

select avg(est_rows) avg_rows,
stddev(est_rows) sd_rows,
avg(blks) avg_blks,
stddev(blks) sd_blks,
min(blks) min_blks,
max(blks) max_blks
from
(
select count(*) est_rows,
count(distinct dbms_rowid.rowid_block_number(rowid)) blks
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
)
/

AVG_ROWS SD_ROWS AVG_BLKS SD_BLKS MIN_BLKS MAX_BLKS
———- ———- ———- ———- ———- ———-
9578.125 1699.43488 31.875 5.6317552 20 43

So while the previous minimum and maximum don't really apply — the estimate could actually be rather higher or lower depending on how many blocks were truly sampled — I'm inclined to think that this still supports the original idea, that the relatively stable number of rows per block in this table make it more amenable to block-based sampling than might be the case in the real world.

However just as I stated before, neither JL's original article nor this blog are really telling you to use a particular method. They are merely exposing the tools and the thought processes that you can follow in order to make your own judgement on the matter.

Update 02 Jun 2005

Prompted by Jan's comment, here's some thoughts on how we can use this information to make decisions on our own system's statistics gathering requirements.

This article is not necessarily intended to give direct guidance on what sampling percentage and methodology you ought to be using on your own data, although hopefully it has already given you ideas. What you might take away from it is exactly what I learned from JL's article — that we ought to be applying measurement techniques to the performance and accuracy of various methods, and making our own decisions based on experimentation. Much of the value of this technique, or indeed any that is based on measurement, comes from being able to document the rationale for your decisions, and this brings to mind two scenarios:

  • You measure and document potential changes to the statistics gathering methodology prior to making any decision to change — an effective proposal can then be put forward … "Based on this analysis, I recommend the following changes to our statistics gathering technique because although our statistical accuracy will degrade by 3%, our statistics gathering workload will decrease by 66%".
  • Having documented the method and the results you can then perform regular reviews (quarterly, for example) to check for any benefits in changing the methodology
  • You can justify the decisions you made if they are challenged in the future, and show that you acted with due diligence and a rational procedure.

I'll also repeat what I mentioned in another comment — that one of the features of JL's article that made it so valuable to me was that it laid bare a problem that I had, but had not realised it. I know that I've been wondering whether computing statistics was beneficial in comparison to estimation, but had never taken the next step of measuring it.

Without wanting to sound like too much of a kiss-ass, thanks again Jonathan.

Original Article

Jonathan Lewis has posted an interesting article on the topic of how precise, and how often, should statistics be collected, in which he demonstrates the influence that different distributions of values have on the reliability of a range of sampling percentages in the collection of statistics.

That was a long sentence. Moving along …

This prompted me to wonder what effect the option to sample "by block", rather than the default of "by row", would have on these numbers. Briefly, the DBMS_STATS package allows you to choose whether the required percentage can be based on a percentage of blocks in the segment or on the percentage of rows — the implication there being that sampling by rows will necessarily access more blocks than sampling the same percentage of blocks.

It seems reasonable to assume than when you choose to sample by block you will get a reduced accuracy on clustered values than on other more random distributions, but the two questions to be answered were:
How much less accurate might the block sampling prove to be?
How much more performant might the block sampling be?

On to the test.

I used the same method to construct a sample data set as in JL's paper, but on 10.1.0.2.0.

I wrote a procedure to allow non-privileged users to flush the buffer cache, so that the wall-clock timings I used for performance comparison would not be influenced by block caching:

create procedure flush_buffer_cache
as
begin
execute immediate 'alter system flush buffer_cache';
end;
/
grant execute on flush_buffer_cache to dave
/

Then I created a table to store the percentages that were to be used in the statistics gathering, and a table to store the results of the tests:

set feedback off
drop table t;
create table t (pct number);
insert into t values (1);
insert into t values (2);
insert into t values (4);
insert into t values (6);
insert into t values (8);
insert into t values (10);
insert into t values (20);
insert into t values (40);
insert into t values (60);
insert into t values (80);
insert into t values (100);

create table result
(
pct number,
block_sampling char(1) check (block_sampling in ('Y','N')),
duration_sec number,
column_name varchar2(30),
num_distinct number(10,2)
)
/

The procedure used to gather the results was as follows:

declare
l_time number;
l_duration number;
begin
for list in (select to_char(pct) pct_txt,pct from t order by pct)
loop
sys.flush_buffer_cache;
l_time := dbms_utility.get_time;
dbms_stats.gather_table_stats(
ownname => user,
tabname => 'T1',
estimate_percent => list.pct_txt,
block_sample => FALSE,
method_opt => 'FOR ALL COLUMNS');
l_duration := (dbms_utility.get_time – l_time)/100;
insert into result
select list.pct,
'N',
l_duration,
column_name,
num_distinct
from user_tab_columns
where table_name = 'T1';

sys.flush_buffer_cache;
l_time := dbms_utility.get_time;
dbms_stats.gather_table_stats(
ownname => user,
tabname => 'T1',
estimate_percent => list.pct_txt,
block_sample => TRUE,
method_opt => 'FOR ALL COLUMNS');
l_duration := (dbms_utility.get_time – l_time)/100;
insert into result
select list.pct,
'Y',
l_duration,
column_name,
num_distinct
from user_tab_columns
where table_name = 'T1';

end loop;
end;
/

Incidentally that code set a new record for me, for the longest code section to compile and work correctly first time. Deep joy.

So, the results. The Y/N indicates whether block sampling was used (Y) or not (N) and the percentage is indicated by the first column of numbers. The two columns under the Y and N indicate seconds duration and estimated number of distinct values.

 
       
       
         
           
    N   Y  
CLUSTERED 1 3.7 52,298 2.3 461
  2 5.3 48,809 4.4 1,046
  4 8.3 50,118 7.4 2,249
  6 10.9 49,885 9.7 3,234
  8 13.8 50,056 12.9 4,348
  10 17.4 50,053 16.2 5,337
  20 27.2 49,996 26.6 11,035
  40 47.2 50,001 45.9 20,845
  60 68.8 50,001 70.0 30,473
  80 93.8 50,001 90.6 40,814
  100 97.5 50,000 97.1 50,000
           
    N   Y  
NORMAL 1 3.7 24,551 2.3 25,248
  2 5.3 25,301 4.4 25,253
  4 8.3 26,624 7.4 26,939
  6 10.9 27,581 9.7 27,765
  8 13.8 28,753 12.9 28,993
  10 17.4 29,905 16.2 29,980
  20 27.2 33,711 26.6 33,833
  40 47.2 37,565 45.9 37,650
  60 68.8 39,652 70.0 39,644
  80 93.8 41,055 90.6 41,083
  100 97.5 42,114 97.1 42,114
           
    N   Y  
SCATTERED 1 3.7 50,918 2.3 47,236
  2 5.3 50,456 4.4 44,017
  4 8.3 49,839 7.4 47,745
  6 10.9 49,865 9.7 51,248
  8 13.8 49,888 12.9 51,105
  10 17.4 49,961 16.2 47,144
  20 27.2 50,024 26.6 50,249
  40 47.2 49,999 45.9 50,001
  60 68.8 50,001 70.0 50,001
  80 93.8 50,001 90.6 50,001
  100 97.5 50,000 97.1 50,000
           
    N   Y  
UNIFORM 1 3.7 49,104 2.3 46,389
  2 5.3 47,515 4.4 49,023
  4 8.3 48,078 7.4 47,797
  6 10.9 48,637 9.7 48,522
  8 13.8 48,742 12.9 48,947
  10 17.4 49,001 16.2 48,955
  20 27.2 49,636 26.6 49,750
  40 47.2 49,988 45.9 49,979
  60 68.8 50,000 70.0 49,999
  80 93.8 50,001 90.6 50,001
  100 97.5 50,000 97.1 50,000
           
     
         

Accuracy of Estimation

For the Normal, Scattered and Uniform columns the estimates are pretty good — at least comparable to those gathered through row sampling. The Scattered estimates fall off a little at the low end of percentages.

The estimates for the Clustered column are not unexpectedly rather poor, even in comparison to the row sampling.

Performance

Performance across the range of percentages of estimate look pretty close between the block and the row sampling, but if you discount the higher percentages for which there is little difference in functionality, you can see that block sampling completes in 62% of the elapsed time for row sampling at 1%, and 83% of the elapsed time at 2%.

Summary

  • For clustered data, block sampling apears to be impractically inaccurate.
  • The performance advantages of block sampling are mostly realised at very low percentages of sampling — this further rules out it's use against clustered data, but as Jonathan Lewis has shown very low sample percentages need not be detrimental to the collection of good statistics.