Coincidentally while planning the downfall of the BETWEEN joins I had an email containing a relevant report SQL that had jumped from 10 minutes execution time to 31 minutes. A prime candidate for testing the theory …
And it works. Report time dropped to 40-50 seconds — depending on how you look at it, that’s an improvement of between 12 and 40 times. Not too shabby, though I say so myself.
One of the recent enhancements to my client’s system called for some rather unconventional financial reports to be available. They’re the sort of thing much beloved of accountants, where row one is the sum of rows two-through-eight minus rows nine, and row eight is the sum of rows 8a through 8g (except for locations beginning with the letter G) etc.. Some gentle probing showed that there were about five different attributes that had to be considered when deciding which rows of a fact table would contribute towards which rows in the report. Roughly speaking these were General Ledger Account (GLA), Customer Class, Source of Funds, Account Classification and Location.
Furthermore, this series of reports were not static but could be expected to change their definition month by month, so that added another new variable making six in all.
Now it’s possible that this could have been done through a Business Objects full client report — the interface is wonderfully maleable and all sorts of things are possible in it, but the Web Intelligence interface did not allow anything close to this flexibility, and we really needed to get this available in the WebI environment.
so that meant that something had to be done in the database.
A little further probing showed that each line in the report could generally be defined by a series of ranges of attributes. For example, line 8 might be defined as the sum for all general ledger accounts between X and Y at locations beginning with the letter G. When all these combinations had been identified the design became based around a hierarchy of tables that defined …
* The report: Name of report, subtitle of report etc
* The report line: Line number, Line display order, Line description etc.
* The report calculation: sets of ranges of attributes that contributed to the report line: Start date, end date, start GLA, end GLA, start location, end location etc.
This had the advantage that we could transparantly change the definition of a line calculation at any time, and have different calculations used in the same displayed report for each month if we (or the client) chose. The fact table was joined to the report calculation table with a series of joins such as …
fct.gla between calc.start_gla and calc.end_gla and
fct.month between calc.start_month and calc.end_month
.. and everyone was happy.
That is, they were happy until a third party put a rule on the firewall that disconnected http connections if they were idle for more than two minutes.
You see, although the method was pretty robust and flexible, it was a long way from being very fast. The reason for this is that when we started using BETWEEN joins against the fact table, we lost our ability to use hash joins. Now for those in the OLTP world the hash join is sometimes a bit, shall we say “maligned”. In fact it’s often disabled in such environments because it’s really only suited for joins involving at least one large data set and requiring many of those rows to be joined — typically this is between a fact table and a dimension table, and it’s a key part of a Oracle’s (patented, apparantly) star transformation. But hash joins only work where A = B, not where A > B, and certainly not where A between B and C.
And so the optimizer was using a nested lookup join between a multi-million row fact table and a few-hundred row sort-of dimension table, and this means S-L-O-W with a lot of single block logical i/o’s and a lot of CPU usage.
So a little while ago I thought of a reasonably elegant workaround for this problem. The solution is either …
1. To reduce the number of rows involved in the nested loop join, or
2. To make the join to the fact table an “A=B” type, thus allowing a hash join to be used.
The solution that I’ve come up with is that we will “materialize” the results of the join between the report line calculation table and the fact table. So we perform the same BETWEEN join as before, but between the report line calculation table and a “Select distinct gla, month … from fct” query (which I reckon to be around 10,000 rows instead of several millions). This will give us a result that we can equi-join to both the report line table and the fact table.
Does any of that make sense? It adds a new stage in to the ETL process in which we refresh the “materialized” join, but that’s not too bad because we can probably just consider the new data being added to the fact table with each load cycle.
My development system likes this a lot more, but I’ll have to send some scripts over to the production boys to get them to check up on it, just to CMA ;)