Quite some time ago I whined about the issue of getting excessive head movement on long sequential reads of the type that data warehouses commonly use. The problem, to paraphrase my earlier article, was that during a read of a large amount of contiguous data the operating system/RDBMS would be too amenable to satisfying other read requests from the same disk, hence incurring head movement too frequently.
This issue popped back into my head after being directed through Log Buffer #11 at Mark Rittman’s site to an article by Curt Monash titled “Is data warehousing all about sequential access?” and which matched my thoughts very well.
I wondered whether the miracles of open source software had anything to offer that might help so I asked the reasonably eclectic crowd at Joel On Software for help — after all, even Google doesn’t help when you don’t know what question you’re asking!
So now that I knew what question to ask it was relatively simple to look for information on Linux 2.6 i/o scheduling, and I found some very interesting resources (IMO).
The 2.6 kernel introduced four types of i/o scheduling.
- Completely Fair Queueing
Have a read about them here. Now also in that article it makes the following unsupported assertion: “.. the Deadline scheduler out-performed CFQ for large sequential read-mostly DSS queries”. That made me a little suspicious, and here’s why.
When Oracle is performing a full table scan using parallel query it is continually issuing read requests of around 1Mb (for example) for a large set of blocks that are contiguous. Hence there ought to be little or no latency due to disk head movement. When another parallel query slave, possibly for the very same query as the first, is also trying to retrieve a large set of contiguous data the danger is that the disk head will continually be flicking around between the two processes, incurring latency each time it does so. The most efficient scheduling method would therefore appear to me to be one that allows the second process to wait while satisfying more requests from the first process, thus reducing the disk head movement and increasing the rate of blocks read from disk.
With that in mind, consider this description of the deadline scheduler: “The deadline scheduler implements request merging … and in addition tries to prevent request starvation (which can be caused by the elevator algorithm) by introducing a deadline for each request. In addition, it prefers readers. Linux does not delay write processes until their data are really written to the disk, because it caches the data, while readers have to wait until their data are available.” Doesn’t sound to me like to would help.
Here is the description of the anticipatory scheduler from the same reference and with my own emphasis: “The anticipatory scheduler implements request merging … and in addition optimizes for physical disks by avoiding too many head movements. It tries to solve situations where many write requests are interrupted by a few read requests. After a read operation it waits for a certain time frame for another read and doesn’t switch back immediately to write requests. This scheduler is not intended to use for storage servers!“.
The anticipatory scheduler (OK, or “elevator”) introduces a very small delay, in the order of one millisecond, following the fulfilment of a read request to see if the same process is going to submit a request for data contiguous with the previous one. If it does then the scheduler will satisfy that request before considering others, thus saving head movement. And the best news is that the delay and other control parameters are configurable at the device level, giving Obsessive Compulsive Tuners some more factors to worry about.
So, how to test this?
I came up with the following plan. For each of the four scheduling options I would measure the performance of a single query that selects a lot of contiguous data via full table scan, and vary the degree of parallelism to see how the query time (and hence the rate of disk reads) varied. The query would be a simple SELECT COUNT(*) to minimise the possibility of the CPUs becoming a choke point for performance.
I wheeled out my handy Poweredge 6400 with four Xeon 700′s, 1Gb of RAM, and four Ultra160 disks of around 16Gb each on an Adaptec AIC7XXX card, and on which is conveniently running the 2.6.9-34.ELsmp kernel and Oracle 10.1.0.3.0 EE with Partitioning.
More later ….