Comments (the type you get in code, that is)

“Supposed” to get, that is. What is it with SQL and comments? Why do we so rarely see explanatory comments in SQL code? Or even a decent structure?

One of the features of front-end tools like Business Objects or Microstrategy is their ability to allow the designer to create extraordinarily long and complex looking SQL with ease. (If you don’t use such tools then don’t go away — I’ll deal with you in a moment). It starts by having an object or two that are pretty simple, then you add them together … then you divide by another two objects added together, but then you need a decode on the divisor to change zero’s to null’s so that the SQL doesn’t error out, and pretty soon you’ve got an expression with a length of 1000 characters that is unreadable to the human eye.

Of course it’s not just extremely long expressions that are tricky … sometimes it’s to look at a piece of SQL even to find out which simple expression relates to which object in the universe/project/whatever. So do me a favour and put a comment block in the definitions … something like …

Sum(order_amt)/Decode(Sum(order_qty),0,Null,Sum(order_qty)) /* “Average Unit Cost” */

… and it’ll appear in the SQL itself. Not too tricky, eh?

You know, I do have a method for doing this automatically by hacking a Business Objects repository with an update statement, but if I told you how to do that and you fkd it up then you’d probably sue me or something.

By the way, the same thing applies to join definitions.

Moving along … if you’re hand-coding SQL statement then you have even less excuse for not commenting your SQL. If you follow the principle that it is better to do things in SQL than PL/SQL then you’re probably packing a fair amount of functionality into each statement, and that means …

  • Write a header to say what the hell the SQL is supposed to do … doesn’t have to be in the SQL itself, just above it will do nicely. In fact, work out this statement even before you write the SQL itself and you’ll keep a more clear mind when you are writing the code. Include in that header how it is going to do it — remember that complex SQL packs a huge amount of functionality into a small amount of text — it’s much more “logically dense” than regular 3GL code. TK has an example of this here, and thanks for the suggestion to include this, Tom.
  • Maybe you might even like to put in a comment about performance, eh? “Runs in about 0.2 seconds” … “Uses a lot of hash joins” … etc
  • Put a statement inside the SQL to say where in the code it is located (package/procedure). You wouldn’t inflict a twenty-page procedure on anyone so it’ll be easy to find.
  • You might even like to put in a unique identifier for the SQL as well as it’s location (for example, so that it’s easy to reference the copy of a “good” explain plan that you’ve kept for that query, in case of problems down the road).
  • Format the code nicely
  • NEW ITEM: if you are referencing any view more complex than “SELECT blah from whatever”, or if any of the tables are accessed remotely through a db link that is obscured by a synonym, or is actually the query name for a subquery factoring clause, then a comment would be appropriate. And when I say “appropriate” I mean just do it.
  • Clearly seperate the joins from the filters
  • Any tricky looking expressions, state what they do. Especially analytic functions and subquery factors (“what they?“)
  • Document anything that looks funky (if you are using “… and rownum > 1” in an in-line view to stop the optimizer from pushing predicates then say so … not everyone is as smart as you are).
  • (Except Tom Kyte (“who he?“), who tells me that he’s perfect. My opinion: man, has he aged since his photo in “Expert One On One”. Am I right, ladies?).
  • Erm …
  • For the love of God, use table aliases. Friendly ones, not a, b, and c.
  • Just use your common sense. People have to support that code
  • That’s it

Back to work, fellas.

Advertisements

BS ALERT: “Temporary Tablespaces Like Large Blocks”

The myth is still alive …

http://dba.ipbhost.com/index.php?showtopic=1512&st=15

"To answer your questions, in general, DSS and OLAP databases (those characterized with lot's of full scans) might see a reduction in consistent gets with larger blocksizes. I have some client's use a db_cache_size=32k so that TEMP gets a large blocksize, and then define smaller buffers to hold tables that experience random small-row fetches."

If the clients are using a default block size of 32kb to do that, Don, I hope it wasn't based on your advice, because the blocks size of a temporary tablespace is irrelevant to the size of i/o it uses, because that is governed by a hidden parameter. The is no logical i/o involved, it is 100% physical.

Better get your excuses lined up for when they find out that this is a Big Oracle Myth, Don. Not only are you giving bad advice to people for free on your forum, you're actually charging your clients for advice that will make their database more complex to create, more difficult to manage, probably more prone to bugs, and with no performance advantages whatsoever.

Mike Ault has excuses ready, by the way … he's just standing by a friend. http://www.blogger.com/comment.g?blogID=11462313&postID=111369118788103463

Comments welcome. As always.

Update on Histograms

OK, that was a bad example. Here's a new approach.

When the Oracle documentation talks about skewed values they are referring to particular values in a column occuring with significantly different frequency than other values. A classical example of this is 95% "M"and 5% "F".

However, let me propose that what this misses is the consideration of situations like the following: 95% A, 5% B, 0% C, 0% D … ie. there are values that do not occur in the column but which the optimizer might believe do occur — or might occur, at any rate.

Let's take a slightly more "real world" example. Let's say that we have a wholesale transactions table that is range partitioned on "Date Of Transaction" at a monthly level, so that there are approximately 30 days to the month. Now the key to this example is that no sales occur on Saturday or Sunday, so in fact there are around twenty unique values to each partition. Each value in our example has the same number of rows, and the distinct values have a pretty even distribution.

Now even if the data does not possess the classical type of skew, there is still a type in there that is hidden from the optimizer, because Saturday and Sunday values of data have 0% representation in the table. The key is to let the optimizer know about these gaps that it is not aware of from High-Low-N column statistics.

Example script:

SQL> drop table demo_hist;

Table dropped.

SQL>
SQL> create table demo_hist
2 as
3 select my_date,
4 num
5 From (
6 Select To_Date('01-aug-2005')+
7 Mod(rownum,
8 Add_months('01-Aug-2005',1)-To_Date('01-Aug-2005'))
9 my_date,
10 rownum num
11 From dual
12 Connect By 1=1 and level
SQL> commit;

Commit complete.

SQL>
SQL> set autotrace on
SQL>
SQL> select count(*) from demo_hist
2 where my_date = to_date('01-aug-2005');

COUNT(*)
———-
3225

Execution Plan
———————————————————-
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=44 Card=1 Bytes=9)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (FULL) OF 'DEMO_HIST' (TABLE) (Cost=44 Card
=2943 Bytes=26487)

Statistics
———————————————————-
48 recursive calls
0 db block gets
251 consistent gets
184 physical reads
0 redo size
394 bytes sent via SQL*Net to client
508 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL>
SQL> select count(*) from demo_hist
2 where my_date between to_date('06-aug-2005') and to_date('09-aug-2005');

COUNT(*)
———-
6452

Execution Plan
———————————————————-
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=44 Card=1 Bytes=9)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (FULL) OF 'DEMO_HIST' (TABLE) (Cost=44 Card
=5830 Bytes=52470)

Statistics
———————————————————-
6 recursive calls
0 db block gets
247 consistent gets
0 physical reads
0 redo size
394 bytes sent via SQL*Net to client
508 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL>
SQL>
SQL> begin
2 dbms_stats.gather_table_stats
3 (
4 ownname => user,
5 tabname => 'DEMO_HIST',
6 estimate_percent => 100,
7 method_opt => 'for columns NUM size 254'
8 );
9 end;
10 /

PL/SQL procedure successfully completed.

SQL>
SQL> select count(*) from demo_hist
2 where my_date = to_date('01-aug-2005');

COUNT(*)
———-
3225

Execution Plan
———————————————————-
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=45 Card=1 Bytes=9)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (FULL) OF 'DEMO_HIST' (TABLE) (Cost=45 Card
=742 Bytes=6678)

Statistics
———————————————————-
42 recursive calls
0 db block gets
192 consistent gets
0 physical reads
0 redo size
394 bytes sent via SQL*Net to client
508 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL>
SQL> select count(*) from demo_hist
2 where my_date between to_date('06-aug-2005') and to_date('09-aug-2005');

COUNT(*)
———-
6452

Execution Plan
———————————————————-
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=45 Card=1 Bytes=9)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (FULL) OF 'DEMO_HIST' (TABLE) (Cost=45 Card
=185 Bytes=1665)

Statistics
———————————————————-
0 recursive calls
0 db block gets
188 consistent gets
0 physical reads
0 redo size
394 bytes sent via SQL*Net to client
508 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL>
SQL> begin
2 dbms_stats.gather_table_stats
3 (
4 ownname => user,
5 tabname => 'DEMO_HIST',
6 estimate_percent => 100,
7 method_opt => 'for columns MY_DATE size 254'
8 );
9 end;
10 /

PL/SQL procedure successfully completed.

SQL>
SQL> select count(*) from demo_hist
2 where my_date = to_date('01-aug-2005');

COUNT(*)
———-
3225

Execution Plan
———————————————————-
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=45 Card=1 Bytes=8)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (FULL) OF 'DEMO_HIST' (TABLE) (Cost=45 Card
=3225 Bytes=25800)

Statistics
———————————————————-
0 recursive calls
0 db block gets
188 consistent gets
0 physical reads
0 redo size
394 bytes sent via SQL*Net to client
508 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL>
SQL> select count(*) from demo_hist
2 where my_date between to_date('06-aug-2005') and to_date('09-aug-2005');

COUNT(*)
———-
6452

Execution Plan
———————————————————-
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=45 Card=1 Bytes=8)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (FULL) OF 'DEMO_HIST' (TABLE) (Cost=45 Card
=6452 Bytes=51616)

Statistics
———————————————————-
0 recursive calls
0 db block gets
188 consistent gets
0 physical reads
0 redo size
394 bytes sent via SQL*Net to client
508 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

Summary …

Predicate = Between
———————— —— ——–
Correct Answer 3,225 6,452
No Stats Estimate 2,943 5,830
Col Stats Estimate 742 185
Histogram Stats Estimate 3,225 6,452

Obviously there's more to this … the histogram-based estimate was so good because the number of distinct values was low, and you'd expect the quality of the estimate to degenerate as the number of distinct values increases.

Thoughts?

Using Histograms on Evenly Distributed Values

A quick demo … here's a script …

drop table demo_hist
/

create table demo_hist
as
select floor(rownum/5) rn,do.* from dba_objects do
/

begin
dbms_stats.gather_table_stats
(
ownname => user,
tabname => 'DEMO_HIST',
estimate_percent => 100,
method_opt => 'for columns OWNER size 254'
);
end;
/

select distinct column_name from user_tab_histograms
where table_name = 'DEMO_HIST'
/

Select min(rn),
max(rn),
count(distinct rn),
count(*)
from demo_hist
/

set autotrace on

select count(*) from demo_hist
where rn = 5
/

select count(*) from demo_hist
where rn between 1 and 100
/

begin
dbms_stats.gather_table_stats
(
ownname => user,
tabname => 'DEMO_HIST',
estimate_percent => 100,
method_opt => 'for columns RN size 254'
);
end;
/

select count(*) from demo_hist
where rn = 5
/

select count(*) from demo_hist
where rn between 1 and 100
/

When I run this I get a demonstration that with a histogram on the RN column the optimizer gets a much better cardinality estimate for the two Count(*)'s on RN = 5 and RN between 1 and 100.

But I'm not going to show you by how much because you need to go and run this script yourself ;)

Whether the results are significant in terms of getting a good execution plan, well this test doesn't demonstrate that. But I believe that it does. More of that damned instinct.

OK, if someone begs me then I'll post my results. But it'll have to be a real grovelling.

A DBMS_STATS Oddity

As part of our load procedure we have various packaged Oracle procedures called by Informatica, in order to perform such tasks as:

  • Analyzing objects
  • Moving tables
  • Rebuilding indexes

When a patch from 9.2.0.5 to 9.2.0.6 was installed, something deep within the bowels of the package went *twang* and the jobs that run the analyze statements started to fail. Or in other words, stopped to work. Or … never mind. It broke.

The problem was a Method_Opt clause of “FOR TABLE”. What that? Well, it’s not in the documentation, but in 9.2.0.5 and 10.1.0.2 it appears to be equivalent to “FOR ALL COLUMNS SIZE AUTO”. I’m not sure what we’re doing using it, but I’m sure that it seemed like a good idea at the time. Anyway, it’s ouda here, and while I’m sniffing around the code I’m going to be running some 10046 traces to see what’s really going on with DBMS_STATS and some of it’s options.

For one thing I want to get rid of collecting histograms for SKEWONLY columns, and just go with all of them. Small-scale tests suggest that the optimizer gets much better statistics to work with when all indexed columns on the fact table are analyzed, and I suspect that the amount of work involved in detecting column value skew is not much different that that involved in collecting the statistics themselves.

Another issue is parallelism. Since a parallelised full table scan (for reasonably large tables anyway) uses direct i/o, there is no caching of the table’s blocks. Fine for tables large enough to flood the block buffers, because physical reads would just keep occuring anyway, but where a table is around three-quarters the size of the block buffers I’m going to experiment with non-parallel statistics gathering, and hopefully this will reduce physical i/o during “the gathering”.

Actually I’ve been doing a smiliar thing already, by setting parallelism on tables based on what proportion of the block buffers their total size, or in some cases their average partition size, represents. The tweaking of the DBMS_STATS parallelism just represents a modification of that principle to reflect the multiple back-to-back reads of the analyzed segment. Maybe that’s another posting though.

Anyhoo, more later.

Prime locations for sponging

AskTom is a great site that I dip into pretty regularly. Say what you like, but you’ll never find Tom Kyte without an opinion or an illustrative example on how to do things the right way. Except when he doesn’t have an opinion or experience, but he’s refreshingly open about that also. His examples are clear and concise (and your own had better be also), and it’s a rare day that you can’t learn something new.

Jonathan Lewis’ site is not so frequently updated, but hosts the excellent Oracle Users Cooperative FAQ, a great source of information.

Howard Rogers recently started a forum at his Dizwell Informatics site that is picking up pace nicely, and blogs his own thoughts and experiences also.

You’ll sometimes find me hanging around the forums at DBAsupport.com, answering more than asking, I guess. The Obfuscation Unlimited forum is usually lively, though less so now that the election is over ;)

The Oracle Technology Network has it’s own forums, which I have mixed feelings about. They’re busy, but the interface is not my favourite, nor is the general style and content of the questions. However, if you’re an Oracle professional and you don’t have an account at OTN then you’re missing out on the best source of information available — the documentation site. Get that on your bookmarks, because 90% of the questions that you see on almost any forum can be answered with a one minute search of the relevant documents. The Concepts Guide, which is essential reading, runs to 732 pages in PDF format, so that gives you a feel for the level of detail available. At the very least it should be browsed so that you know that the information exists, even if you can’t absorb it all.