Linux 2.6 Kernel I/O Schedulers for Oracle Data Warehousing: Part II

Here are some preliminary test results for a single disk query.

The disk is a Seagate Cheetah ST318404LC, of 18.37Gb and 10,000 rpm. Manufacturer’s read seek times are 5.2ms (average), 0.6ms (single track), 10ms (max full seek).

A single tablespace was created on the disk, with a single table create and populated as in this script.

create tablespace sandbox_one_disk
datafile '/opt/d1/sandbox__one_disk.dbf' size 4g
extent management local uniform size 256M
segment space management manual
/

create table t2
   (col1 varchar2(1000))
tablespace sandbox_one_disk
pctfree 99 pctused 0
noparallel
nologging
nocompress
nomonitoring
storage (minextents 8)
/
insert /*+ append */ into t2
select lpad(rownum,100)
from   dual
connect by level <= 250000
/

begin
   dbms_stats.gather_table_stats
      (ownname          => user,
       tabname          => 'T2',
       estimate_percent => 1,
       block_sample     => true,
       method_opt       => 'for all columns size 1');
end;
/

This gave a table size of just less than 2Gb to be queried as follows:

set timing on echo on heading off
select /*+ noparallel(t2) */ count(*) from t2;
select /*+ parallel(t2 2) */ count(*) from t2;
select /*+ parallel(t2 3) */ count(*) from t2;
select /*+ parallel(t2 4) */ count(*) from t2;
select /*+ parallel(t2 5) */ count(*) from t2;
select /*+ parallel(t2 6) */ count(*) from t2;
select /*+ parallel(t2 7) */ count(*) from t2;
select /*+ parallel(t2 8) */ count(*) from t2;
select /*+ parallel(t2 20) */ count(*) from t2;
select /*+ parallel(t2 40) */ count(*) from t2;
select /*+ parallel(t2 80) */ count(*) from t2;

The wall clock time for each query was noted and graphed.

Each test run was executed with the server rebooted and reconfigured to use a different i/o scheduler each time. The scheduler in use was verified using:

dmesg | grep scheduler

All four available schedulers were tested with the following results:

In the above results a parallelism of “1” represents nonparallel query without direct serial reads being used.

I shall pause here for comments … :)

Advertisements

35 thoughts on “Linux 2.6 Kernel I/O Schedulers for Oracle Data Warehousing: Part II

  1. Very interesting. With a simple benchmark you got some nice results.

    Obviously, anticipatory I/O scheduler is the best for this kind of workload, as it was expected of it (knowing the theory). It’s just nice to see it all graphed and put together.

    Thanks.

  2. I should emphasise that this was just the most simple test I could come up with, not the be-all and end-all of the issue. It’s an interesting result, but I also have some numbers for a PQ example spread over four (JBOD) disks, and I intened to try a couple of RAID examples (coarse and fine grain striping as well) to see how they pan out.

  3. 4 x Xeon 700’s. L1 I Cache 16K, L1 D Cache 16K, L2 cache 1024K.

    Measurement by Mk.I Eyeball indicates that they were untroubled by the tests with any reasonable degree of parallelism. Up around silly DoP’s like 30 they were showing in the order of usr 10%, sys 10%,io 60%, idle 20%.

  4. David, will this be linux’ own raid code, or “magic hardware” one? I’d suspect linux’ own will give better info for it’s own disk scheduler (but I didn’t check the code to confirm it)

  5. The test system has an Adaptec aic7899 ultra160 SCSI Adapter, so I was going to configure RAID through that — unless there are compelling reasons to setup an alternative, of course. ;)

  6. I’m just wondering how likely it would be that you’d want to run lots of parallel slaves against a single disk and whether that’s really going to be better than just using striping? I may be missing something (I’m in the middle of a 150-page design document for a super-availability system so my brain is hurting) but aren’t you proving that attempting high-end tasks on low-end hardware can be improved using a different i/o scheduler, when there might be much better ways of approaching the problem?

    Maybe a better test would be to compare four of these tests in parallel against four different disks (i.e. manual physical placement) against four of these test in parallel against a four disk stripset (i.e. scattered physical placement).

    I can see you’ve proved something interesting but wonder how useful it would prove on a real world DW.

  7. I’d say that it’s not at all likely — or very unlikely. This is just a demonstration of the most simple case I could think of as an aid to understanding later JBOD and RAID tests.

    I should point out that this is a pretty fast disk in terms of seek speed — I’d like to run this test with a more commodity-oriented disk as I think it would show a more clear advantage, but I don’t know that I have the means to do so.

  8. I’d like to run this test with a more commodity-oriented disk as I think it would show a more clear advantage

    I think you’ve already shown a pretty clear advantage for this type of activity for a single job on a single disk. I just think the gap will be smaller when you add users accessing different data concurrently and spread the data across the 4 disks. i.e. Something more similar to what’s practically useful on a corporate DW.

    All interesting stuff, though, much more so than wading through a vendor’s design documentation ;-)

  9. Yes, reduced benefit for multiple disks is also my expectation. I’m pretty sure that there will still be a significant difference, after all head movement is head movement and I’ve read that one of the significant advantages of the appliance DW vendors’ technologies is the ability to reduce head movement and keep disks closer to their theoretical bandwidth, but how big the difference is remains to be seen.

    I’m wondering whether to include a mixed large/small reads test as well — it seems to me that anticipatory scheduling could be one reason for systems statistics yielding MREADTIM

  10. I’ve read that one of the significant advantages of the appliance DW vendors’ technologies is the ability to reduce head movement and keep disks closer to their theoretical bandwidth

    Maybe that’s the issue here (which I was trying to get at above), that a *combination* of anticipatory scheduling and careful physical location of data could potentially give significant improvements, but it would only suit certain workloads. (Opinion warning …) Default Oracle and o/s behaviour is almost bound to behave in a generic, less optimal but more flexible way.

  11. I’m wondering whether to include a mixed large/small reads test as well — it seems to me that anticipatory scheduling could be one reason for systems statistics yielding MREADTIM

    It would certainly be a cool idea to gather system stats at the same time and see the results.

    As for different large/small read tests, that may not be as useful as lots of large read tests (just like a warehouse), but for more widely seperated data/partitions (just like a warehouse). One of the flaws I was conscious with on my PX tests were that the workload wasn’t mixed-up enough.

  12. Isn’t the anticipatory scheduling algorithm based on the assumption that the i/o-device only has one seek head, as is true with single disks? (that’s even documented in the kernel source code). I would guess that as soon as you put in a hardware raid solution you would probably see some reversed results.

  13. Doug,

    I have a feeling that the separation of data onto different devices, which may be what you mean by separated data/partitions, may be just another way of reducing disk head movement by avoiding contention for the same devices between PQ slaves. My feeling is that if I allow adequate separation of data onto multiple devices, by having four data files on four disks with an even spread of data between them, then the non-anticipatory schedulers will show better scalability with increased DoP up to the point that they start to contend for resources.

    I don’t think that that is neccessarily representative of a real-world workload though. What I’m hoping is that the anticipatory scheduler will help reduce the inevitable contention where the PQ slaves cannot be targeted by Oracle to independent devices, thus “levelling the playing field” between separation and non-separation.

    Kristian,

    I think that’s a good point, and reflects my intent to test RAID configurations having both fine and coarse striping — fine striping would simulate a single head configuration and coarse would reflect the true multiple head configuration. It gets a mention in the Oracler docs here: http://download-west.oracle.com/docs/cd/B19306_01/server.102/b14211/iodesign.htm#sthref707

  14. I’d better stop commenting to leave you testing time, but to clarify, what I meant was …

    Imagine you have 4 disks and each disk contains 1 month’s worth of data. Now you have 4 different queries running in different sessions, each of which is interested in a different month’s data and using PX. That would be the situation when the improvement you’ve described might work best (although it’s unlikely to be that common). Once you have a normal mix of users doing a variety of things, some using PX or not, across different data and the same data then I think the improvement will just become less and less.

  15. If those four users were not using parallel query then they’d get better performance, I’d think — as soon as each disk suffers simultaneous read requests then performance will drop (illustrated by the sharp rise from “1” to “2” in the graph for the three other schedulers). Once each disk is receiving multiple requests, I’m not sure that it makes much difference which user is querying which disk (except that caching would more likely come into play if different users are requesting the same data from the sasme disk). The effect is just the same.

    My expectation is that tests will demonstrate that the AS will not show a dramatic fall-off in performance when moving from one query per disk to multiple queries per disk. Well, we’ll see. I’m thinking of upgrading the database on the test machine to 10gR2 to get better trace data before going too far. Sound sensible?

  16. If those four users were not using parallel query then they’d get better performance, I’d think — as soon as each disk suffers simultaneous read requests then performance will drop (illustrated by the sharp rise from “1″ to “2″ in the graph for the three other schedulers).

    Yes, I see what you mean more clearly now. In fact the small rise from 1 to 2 with anticipatory scheduling is probably even more relevant.

    However, I still don’t think it’s cut and dry, particularly if you start performing sorts or had multiple users all hitting the same month’s data – different types of queries kicking off at different times. The thing is that there’s usually CPU and memory to spare so if you’ve made the disk i/o as efficient as possible, there could be other areas that benefit from the parallelism.

    The gist of most of my comments here are to make the test more applicable to a realistic end result. Your mention of DW appliances also made me thing about the problem in a different way.

    I’ll be really interested in how goes when you stripe. Personally I have a strong feeling that the results are going to be much less useful once you muddy the waters.

    It’s also a shame that, like me, you simply don’t have enough i/o resource to perform even more interesting tests. I just have to save up for a SAN. For a *very* long time ;-)

  17. Have a snooze and miss all this!
    I’ve read that one of the significant advantages of the appliance DW vendors’ technologies is the ability to reduce head movement and keep disks closer to their theoretical bandwidth
    For things such as Neteeza the speed comes from two factor hashing the data across as many volumes as possible and pushing the predicate processing back to the disk i.e only the rows that match actually get on the IO channel to the database CPU and so maximise IO rate. Other vendors encrypt/tokenise and again make maximium use of the scarce IO resource of the disk controller (which I suspect in a multi-disk system is more of a limitation than the speed of a single disk

    It would be interesting to see what happens on ASM managed file systems, maybe in an idle moment?

  18. Doug,

    With regard to sorting, I’m wondering whether AS would help with that or not because of the mix of reads and writes. I’ll have to look back to Jonathan Lewis’ sessions from Hotsos 2006 to see if he mentioned anything about the types of reads and writes — I’m pretty sure that the reads are large but AS might favour reads at the expense of writes.

    I posted a question on a linux forum and found that the scheduler can be set at the device level very easily, and I know that there are AS tuning parameters also available for such things as the anticipation time, the expiry time and whatnot. I might fiddle with that as well. Maybe I’ll have a look on ebay for a second hand SAN :D

    Pete,

    It’s interesting stuff, pushing the predicate down to the hardware level. I’d argue that with detailed and well-thoughtout partitioning Oracle can identify a block of records that are all, or very nearly all, of interest (“Retail sales for location 27 in January”, for example) and if so that would negate the advantage of hardware-level row filtering as all rows would be of interest. Not all columns though — I wonder if Netezza’s technology just sends the required columns from disk as well? But with a combination of detailed partitioning and read rates sustainable near the theoretical maximum for the devices Oracle ought to be able to give the appliances a pretty good run for their money.

    I keep meaning to try and find out more about how Oracle Corp plans and runs it’s benchmarking for TPC-H and whatnot — that ought to be of interest. I wonder if they have any tricks up their sleeves (ie. buried in the full disclosure report) that we don’t know about? They tend to use SAN’s, I supppose.

  19. Pingback: DW Appliances « Pete-s random notes

  20. Kristian raised a very good point. And David counter-pointed with some very interesting info in oracle’s own docs.

    I’d argue that given a sufficiently large and reasonably random sample of io requests, most OS i/o schedulers will essentially behave the same. Ie: they level out in relative performance once the request load increases to a certain limit level.

    And that stands to reason: no scheduler no matter how sophisticated is going to make a given disk run faster or pump data through at a faster rate. They all are ultimately limited by the underlying hardware.

    As an side, due to what is commonly called “thrashing” it’s quite possible that with extreme load situations many schedulers will hit a wall and actually become a bottleneck. I’ve had that experience with the io-elevators in 2.4: in a certain busy system I actually got better results by reducing the io-elevator queue “inspection” times to 0 – thereby effectively turning them off! It then turned out I also killed the performance for non-extreme situations where the io-elevators would have been useful. The sweet spot setting turned out to be slightly above 0, much less than the OS default. It delayed the onset of “thrashing” and still alllowed some optimisation to take place at lesser load levels.

    It’s how much load it takes to get to the limits and how well behaved schedulers are OUTSIDE of the peak load zone that becomes interesting, IME. At the moment, I’d say the cfq is not as good as the anticipatory scheduler in these boundary areas, with database workloads.

    It might well be that in a file server load for example – well defined, compartmented data – the situation would be reversed. Hence the choice that has been made by the kernel folks to go with the cfq as default.

    The cfq will get priority-based scheduling soon and that will introduce another big variable. My feeling at this stage is that the cfq shows excellent potential that might indeed be reached with the io priorities. Until then it might be a worthy alternative to use the anticipatory scheduler for most database loads. But like in anything: testing and accurate measuring is the only way to reach a good balance.

    At the limits, all io schedulers will behave essentially in a similar fashion. Particularly when they “thrash”: that’s real bad and it happens quite often with 2.4 kernels.

  21. Good thoughts, Noons, but I’ll take up the point of disk thrashing as that is specifically what the AS is designed to reduce under particular circumstances.

    When the disks are thrashing (what I’ve seen described rather picturesquely as a “seek storm”) with small random i/o’s there’s really no saving the situation. Maybe some reads can be queued and re-ordered to take advantage of shorter stroke times from A -> B -> C rather than going the long way round from A -> C -> B, but the total seek time can be reduced only through tackling the average seek time rather than the number of seeks.

    With sequential reads of contiguous disk space the actual seek count itself can be reduced — maybe also the average wait by re-ordering the reads but the higher time that deprioritised reads (“C” in the example above) spend waiting would probably be seen as too disadvantageous too implement it.

    So with that in mind the AS ought to be able to support a higher aggregate disk read rate in such circumstances — what must be born in mind is that the sauce for this goose would definitely not be sauce for the random read gander. That means that control files and OLTP data files and whatnot would probably suffer under AS and would do much better under one of the other three.

    I have to say that a DW simulation must be a lot easier to code than an OLTP one — I’ll leave the CFQ/NOOP/Deadline compo on OLTP to someone else thanks :)

  22. With sequential reads of contiguous disk space the actual seek count itself can be reduced — maybe also the average wait by re-ordering the reads but the higher time that deprioritised reads (”C” in the example above) spend waiting would probably be seen as too disadvantageous too implement it.

    Very much so. In fact, I think “C” in your case above is exactly what happens for example in DSS database workloads on SANs where it becomes the odd “read index block” while “A,B” is the usual range scan, hopefully clustered at physical level.

    Seen many cases of just that: all of a sudden indexed reads become the major source of read waits and yet the disk load factor is nowhere near 100%. It’s then that I zero in to the scheduler elevator tuning and reduce the re-ordering. “C”-type reads – index blocks mostly – usually pick up as a result with no majo detriment to “A,B”.

    Dang, this blog almost made me want to go back to the old kernel 2.4 job! :-)

  23. >> Dang!

    Always happy to help!

    Here’s some additional nuggets to digest, although I’d file them under “Important if true” rather than properly demonstrated …

    I enabled Async and direct IO, relinking oracle and checking that io was async using …
    cat /proc/slabinfo | grep kio
    (look at me, just like a proper DBA). Performance went down the can on the anticipatory scheduler. The linearity with increasing DoP was lost.

    DOP Query Time
    “1” 65
    2 101
    3 110

    I disabled direct/async and turned off device read ahead with …
    blockdev –setra 0 /dev/sdb1
    (look at me, just like a proper sys admin)
    Previously it was 256, meaing a readahead size of 128kb

    DOP Query Time
    “1” 53
    2 60
    3 64
    4 88
    5 124

    Better performance with low contention, worse with higher contention.

    Then I bumped the readahead way up to 2Mb, just for kicks.
    blockdev –setra 4096 /dev/sdb1

    DOP Query Time
    “1” 61
    2 61
    3 64
    4 92
    5 90
    6 89
    7 97
    8 102

    This last set of numbers intrigues me. I wonder if the duration of the reads is long enough with DoP=4 that the AS won’t wait for another to be submitted, and will scamper off to perform a read on behalf of another process?

  24. This last set of numbers intrigues me. I wonder if the duration of the reads is long enough with DoP=4 that the AS won’t wait for another to be submitted, and will scamper off to perform a read on behalf of another process?

    I’d suggest that it might be worth looking at some raw trace files too. They won’t tell you precisely what the o/s is doing, but they might reveal more information than just looking at absolute response times.

    That’s another thing I learnt while playing around with the PX stuff. Jeff Moss also ran the tests and there were times when his results didn’t make sense until I looked at some trace files.

  25. I had another thought about the variation of performance with readahead size — the ST318404LC disk has an internal buffer of 4Mb, so a large readahead that gets buffered on disk is going to be wasted if there are enough processes reading the disk and the scheduler doesn’t wait long enough for the same process to request that additional data. Not that it explains the 0 read ahead test though :(

  26. ah yes: if you want to play around with drive settings and cache, “man hdparm” in Looneeks will be your friend! ;-)

  27. Pingback: Rittman Mead Consulting » Blog Archive » DW Appliances

  28. Pingback: Check or Change I/O Scheduler (elevator) - herbertm.ca

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s