Linux 2.6 Kernel I/O Schedulers for Oracle Data Warehousing: Part I

Quite some time ago I whined about the issue of getting excessive head movement on long sequential reads of the type that data warehouses commonly use. The problem, to paraphrase my earlier article, was that during a read of a large amount of contiguous data the operating system/RDBMS would be too amenable to satisfying other read requests from the same disk, hence incurring head movement too frequently.

This issue popped back into my head after being directed through Log Buffer #11 at Mark Rittman’s site to an article by Curt Monash titled “Is data warehousing all about sequential access?” and which matched my thoughts very well.

I wondered whether the miracles of open source software had anything to offer that might help so I asked the reasonably eclectic crowd at Joel On Software for help — after all, even Google doesn’t help when you don’t know what question you’re asking!

So now that I knew what question to ask it was relatively simple to look for information on Linux 2.6 i/o scheduling, and I found some very interesting resources (IMO).

The 2.6 kernel introduced four types of i/o scheduling.

  • Completely Fair Queueing
  • Deadline
  • NOOP
  • Anticipatory

Have a read about them here. Now also in that article it makes the following unsupported assertion: “.. the Deadline scheduler out-performed CFQ for large sequential read-mostly DSS queries”. That made me a little suspicious, and here’s why.

When Oracle is performing a full table scan using parallel query it is continually issuing read requests of around 1Mb (for example) for a large set of blocks that are contiguous. Hence there ought to be little or no latency due to disk head movement. When another parallel query slave, possibly for the very same query as the first, is also trying to retrieve a large set of contiguous data the danger is that the disk head will continually be flicking around between the two processes, incurring latency each time it does so. The most efficient scheduling method would therefore appear to me to be one that allows the second process to wait while satisfying more requests from the first process, thus reducing the disk head movement and increasing the rate of blocks read from disk.

With that in mind, consider this description of the deadline scheduler: “The deadline scheduler implements request merging … and in addition tries to prevent request starvation (which can be caused by the elevator algorithm) by introducing a deadline for each request. In addition, it prefers readers. Linux does not delay write processes until their data are really written to the disk, because it caches the data, while readers have to wait until their data are available.” Doesn’t sound to me like to would help.

Here is the description of the anticipatory scheduler from the same reference and with my own emphasis: “The anticipatory scheduler implements request merging … and in addition optimizes for physical disks by avoiding too many head movements. It tries to solve situations where many write requests are interrupted by a few read requests. After a read operation it waits for a certain time frame for another read and doesn’t switch back immediately to write requests. This scheduler is not intended to use for storage servers!“.

The anticipatory scheduler (OK, or “elevator”) introduces a very small delay, in the order of one millisecond, following the fulfilment of a read request to see if the same process is going to submit a request for data contiguous with the previous one. If it does then the scheduler will satisfy that request before considering others, thus saving head movement. And the best news is that the delay and other control parameters are configurable at the device level, giving Obsessive Compulsive Tuners some more factors to worry about.

So, how to test this?

I came up with the following plan. For each of the four scheduling options I would measure the performance of a single query that selects a lot of contiguous data via full table scan, and vary the degree of parallelism to see how the query time (and hence the rate of disk reads) varied. The query would be a simple SELECT COUNT(*) to minimise the possibility of the CPUs becoming a choke point for performance.

I wheeled out my handy Poweredge 6400 with four Xeon 700’s, 1Gb of RAM, and four Ultra160 disks of around 16Gb each on an Adaptec AIC7XXX card, and on which is conveniently running the 2.6.9-34.ELsmp kernel and Oracle EE with Partitioning.

More later ….


14 thoughts on “Linux 2.6 Kernel I/O Schedulers for Oracle Data Warehousing: Part I

  1. Ah! Let me pick on this:

    ” of a single query that selects a lot of contiguous data via full table scan”

    Yes. Defining “contiguous” in the oracle context, is the problem I’ve been finding. Yes, full table scans are supposed to start at one end of a table and finish at the other, smooth run all the while.

    Except when:

    1- there is parallel reading going on and Oracle picks a non-documented, little understood pattern to allocate blocks – pages – to each reader.

    2- there is more than one datafile in the tablespace: who knows in what pattern Oracle wrote the data in them? Like: if they were allocated at the same time and the same size, the blocks will be round-robin in the datafiles. Otherwise, they’ll be first in the first file allocated, then in the second. Then what happens if the dba sizes the first file again when the second file fills up? a few options and cominations right here for the reading process(es) to take care of.

    3- there is more than one freelist or freelist group in the tablespace or the tablespace is ASSM. What happens then to each parallel reader and how do they apportion the load?

    an so on. Without even touching into file-system IO vs raw IO and the immense range of possibilities and strategies for space allocation within each choice, before it even becomes populated by db data.

    You catch my drift: it’s only when we have absolute control of the allocation and its pattern that we can say for sure one method of IO scheduling will be superior to another. To the best of my knowledge, this is now incredibly difficult to do inside Oracle. Or DB2. Or any other relational database, for that matter.

    And it’ll get worse as Oracle bundles more and more “we don’t need no dbas” features into the kernel. To the point where it’s gonna be impossible to determine what pattern of IO is going to be better given any sample database. Other than by trial-and-error.

    This is where the CFQ and deadline schedulers will become relevant: they take a middle of the road approach and assume not much about prior organisation of data other than arrival of requests.

    Unfortunately, for general purpose databases I feel this will be the case in future.

    Far from optimal. And wide open for someone with a more lucid attitude to the needs of specials cases – like DW – to come up with a much better solution than Oracle will be capable of, with its one-size-fits-all mentality.

    but that’s just me…

  2. … and then there is ASM – loads of blocks all over the place. True for a single sequential read the heads do a merry dance, but in the “real DW world” of multiple queries and parallel whatnot does this slowing of one read matter when other process are being equally served at the same time?

    No, I don’t know the answer

  3. No offense Dave but how do you have time to wonder about where the diskheads are on every disk drive???? Where I work we are moving to a SAN, I won’t know how many disks make up a volume, much less where any given disk head is at any time. You should really get a hobby!!! ; )

  4. Noons …

    For general purpose/OLTP databases I think that your points are very valid, but don’t neglect the context — data warehousing, where jumping through hoops to achieve goals that make no sense in other worlds is practically a way of life. This is by no means a “one size gfits all” topic — everything here is qualified by “Data Warehousing only”

    The scan doesn’t have to be 100% contiguous through the entire segment, we just have to avoid the kind of excessive disk head movement that would harm performance. From the Oracle side that means allocating reasonably large extents — at least 32-64 Mb I’d think, although in practice we can easily go larger than that, into the Gb range for large data sets (the practical upper limit would be the least of the partition size and the ETL single load volume, divided by the number of devices over which they are spread, off the top of my head).

    We also want to avoid ASSM — in the data warehouse we almost certainly want to promote a physical row order to either promote efficient compression or to lower the clustering factor for some set of columns indexes. ASSM does not provide a solution to any problem we encounter in DW, so I would not use it. PCTFREE 0, PCTUSED 99, COMPRESS, and use direct path to load the data.

    I don’t think that parallel query is such a problem either — Oracle will attempt to allocate ranges to each PQ slave based on datafile if possible, and in any case will allocate ranges of rowid’s that make each slave’s target data set logically contiguous. Whatever mysterious algorithms govern the allocation for PQ slaves must surely allocate ranges with a view to logical contiguity (if that is a word) or the result would be terribly messy.

    Multiple data files ought also to be no problem. Large extents and allocating new extents in multiples of the number of data files appears to take care of that (on LMT), although I haven’t run a rigourous test on it. Maybe that’s next :D

    I don’t think that freelists, freelist groups and ASSM are relevant to PQ.

    Now it could well be that it’s difficult to just say, “Scheduler X is the best for databases” (in fact I’m sure it is), but if we characterise databases by access patterns (“small random access with reads and writes mixed”, “large sequential access with reads during the day and writes at night”) then there ought to be a theoretical best choice, and we ought to be able to run a test to demonstrate that. In the “real world” we also ought to be able to benchmark applications and look at mean response times over the course of a week ora month when running different schedulers — maybe our theory gets supported and shows us and average response time 30% lower with scheduler X than scheduler Y. The important thing I think is to be aware that there are these options and to understand what the choices imply in terms of how they interact with our applications’ access patterns.

    Here’s something else to consider — in 2.6.18 the default scheduler changes from anticipatory to CFQ What impact is that going to have on your application? Maybe it’ll be an improvement, maybe it’ll counteract benefits experienced elsewhere. Maybe when you upgrade through that boundary you want to start passing “elevator=as” to the kernel at boot time to prevent the scheduler change.

  5. Pete,

    If ASM scatters blocks around in a way that promotes excessive head movement, in comparison to the regular management, then I’d expect to see a performance issue. If there was a performance issue then that would be a black mark against its use. however, I don’t think it does (though I may be wrong).

    Does the issue of the slowing of one read matter when other process are being equally served at the same time? That’s a question at the very heart of i/o scheduling (and queueing theory, which is what alll this is about) — what is meant by “equally served”? If it means that the scheduler finishes with one read request and then immediately moves the disk heads to satisfy another, then moves the heads back to satisfy another request from the first process then that might seem equitable, but it’s exactly analagous to a single check-in handling multiple check-ins at the airport at the same time.

    Consider passenger A and passenger B, both waiting to be served. To check in each passenger takes five minutes, so passenger A is checked in in five minutes and passenger B waits for five minutes then gets checked in and is gone after a total wait of ten minutes. If, in an effort to be equitable to both parties, the check-in agent flits between the two then the total time to check them both in is now eleven minutes (taking into account a total latency of one minute due to walking between the desks), and they both wait the full eleven minutes to be finished. Not equitable at all!

  6. Gandolf,

    I don’t have to worry about it myself, I just have to worry about what algorithm is managing it. I know that disk head movement represents an inefficiency that can theoretically be reduced by the choice of the correct scheduler, so all I need to do is choose the right one and provide the conditions for it to perform optimally (ie. large sequential reads, reasonably contiguous data to read).

    With regard to a SAN, I suggest that you make it very clear to the administrators that they n0ow have an application with predicatable access patterns — say large sequential reads of logically contiguous data. If their answer to problems is that they have enough disks and cache to cope with anything that can be thrown at it then consider how much better performance could be if they would actually optimise the storage in line with the application’s use of it. Recall also how the attitude to RAID5 has changed over the years — no longer is the “enough disks and cache” thought to be an adequate response to the problems inherent in the technology.

    By the way, this is my hobby! :D Sad but true.

  7. Regarding Noon’s point 2 and conclusion, and the DW context response: I’m strongly skewed towards OLTP, but in reality many ERP or financial systems still have batch jobs, partcularly end-of-period processing. These jobs likely have at least some full table (or partition) scans, and are likely very visible to management in terms of total clock time run. So this sort of performance issue becomes important, perhaps (if one puts costed and ordered business requirements into the equation) more so than the OLTP operations, which are going to thrash anyways by their nature.

    I’ve been idly wondering if using RMAN as a benchmark for sequential read operations v. parallelism might be a reasonable idea. For example, to test whether 30 2G files v. 1 60G file make a difference because of round-robinning – imp into each, then see how long RMAN takes to backup each with various numbers of channels.

    Wish I had a hobby! :-)

  8. >> many ERP or financial systems still have batch jobs, partcularly end-of-period processing.

    One of the issues I’m also looking at is whether it is possible to change the scheduling algorithm for a device dynamically — ie. without unmounting/mounting. It would anable the devices to be put into OLTP and batch-sympathetic i/o scheduling modes at different times of the day/month.

    I have some preliminary results, by the way. The anticipatory scheduler is showing much better performance and, maybe as importantly, stability of performance with variations in degree of parallelism, even up to degrees that represent gross over-parallelism (DoP of 30 for data on 4 disks).

  9. Interesting blog, Dave.

    “I have some preliminary results, by the way. The anticipatory scheduler is showing much better performance and, maybe as importantly, stability of performance with variations in degree of parallelism, even up to degrees that represent gross over-parallelism (DoP of 30 for data on 4 disks).”

    I think the major possible failing in this type of test if it’s based on a single user because once those other tricky users start wanting to do stuff too, it’s going to undermine the purity of the tests. i.e. contiguous storage and the best scheduler might be very important to the importance of improving a single job but the blasted things tend to be multi-user ;-) (Oh, and I don’t see that as the same as PX being ‘multi-user’ because there’s co-ordination of the activity towards a single end-goal)

  10. PQ represents a very particular type of multiuser workload — requests of the same type, the same size, ultimately requesting a similar amount of data, and no two threads requesting the same data. As long as we acknowledge that then we know the extent of the test’s validity. That would be the manifestation of the goal, I guess — the disks and the i/o scheduler have no direct insight into the nature of the goal at all other than that.

    We can then look at what variations might occur and theorise what effect that might have. eg.

    * some reads requesting the same data as others, and hitting disk/controller cache.
    * some index range scans issuing a very large number of small read requests, and each of those requests being delayed in favour of multiblock reads and hence suffering from performance degradation.

    I’ll post a chart or two, Doug, which show good support for the way that the anticipatory scheduler keeps the disks reading at close to their theoretical capacity even in this type of multiuser environment. As soon as I’ve taken the darned kids to daycare –i can hear them arguing about a monkey right now.

  11. I’ll post a chart or two, Doug

    Not more graphs! ;-)

    As soon as I’ve taken the darned kids to daycare –i can hear them arguing about a monkey right now.

    I know, I know, I find myself in that situation all the time – arguing about a monkey.

  12. Pingback: Oracle Clinic

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s