<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Linux 2.6 Kernel I/O Schedulers for Oracle Data Warehousing: Part I</title>
	<atom:link href="http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/feed/" rel="self" type="application/rss+xml" />
	<link>http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/</link>
	<description>Oracle Data Warehouse Design and Architecture</description>
	<pubDate>Fri, 04 Jul 2008 18:50:36 +0000</pubDate>
	<generator>http://wordpress.org/?v=MU</generator>
		<item>
		<title>By: Doug Burns</title>
		<link>http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/#comment-2501</link>
		<dc:creator>Doug Burns</dc:creator>
		<pubDate>Mon, 02 Oct 2006 17:36:14 +0000</pubDate>
		<guid isPermaLink="false">http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/#comment-2501</guid>
		<description>It took me a while to 'get' that comment. I thought you were on about one or two &lt;i&gt;monkeys&lt;/i&gt;!</description>
		<content:encoded><![CDATA[<p>It took me a while to &#8216;get&#8217; that comment. I thought you were on about one or two <i>monkeys</i>!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David Aldridge</title>
		<link>http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/#comment-2499</link>
		<dc:creator>David Aldridge</dc:creator>
		<pubDate>Mon, 02 Oct 2006 15:26:11 +0000</pubDate>
		<guid isPermaLink="false">http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/#comment-2499</guid>
		<description>Just one or two .. it's not like I'm going to go crazy on them, like some people ;)</description>
		<content:encoded><![CDATA[<p>Just one or two .. it&#8217;s not like I&#8217;m going to go crazy on them, like some people ;)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Doug Burns</title>
		<link>http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/#comment-2498</link>
		<dc:creator>Doug Burns</dc:creator>
		<pubDate>Mon, 02 Oct 2006 15:04:33 +0000</pubDate>
		<guid isPermaLink="false">http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/#comment-2498</guid>
		<description>&lt;i&gt;I’ll post a chart or two, Doug&lt;/i&gt;

Not more graphs! ;-)

&lt;i&gt;As soon as I’ve taken the darned kids to daycare –i can hear them arguing about a monkey right now.&lt;/i&gt;

I know, I know, I find myself in that situation all the time - arguing about a monkey.</description>
		<content:encoded><![CDATA[<p><i>I’ll post a chart or two, Doug</i></p>
<p>Not more graphs! ;-)</p>
<p><i>As soon as I’ve taken the darned kids to daycare –i can hear them arguing about a monkey right now.</i></p>
<p>I know, I know, I find myself in that situation all the time - arguing about a monkey.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David Aldridge</title>
		<link>http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/#comment-2497</link>
		<dc:creator>David Aldridge</dc:creator>
		<pubDate>Mon, 02 Oct 2006 14:53:13 +0000</pubDate>
		<guid isPermaLink="false">http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/#comment-2497</guid>
		<description>PQ represents a very particular type of multiuser workload -- requests of the same type, the same size, ultimately requesting a similar amount of data, and no two threads requesting the same data. As long as we acknowledge that then we know the extent of the test's validity. That would be the manifestation of the goal, I guess -- the disks and the i/o scheduler have no direct insight into the nature of the goal at all other than that.

We can then look at what variations might occur and theorise what effect that might have. eg. 

* some reads requesting the same data as others, and hitting disk/controller cache.
* some index range scans issuing a very large number of small read requests, and each of those requests being delayed in favour of multiblock reads and hence suffering from performance degradation.

I'll post a chart or two, Doug, which show good support for the way that the anticipatory scheduler keeps the disks reading at close to their theoretical capacity even in this type of multiuser environment. As soon as I've taken the darned kids to daycare --i can hear them arguing about a monkey right now.</description>
		<content:encoded><![CDATA[<p>PQ represents a very particular type of multiuser workload &#8212; requests of the same type, the same size, ultimately requesting a similar amount of data, and no two threads requesting the same data. As long as we acknowledge that then we know the extent of the test&#8217;s validity. That would be the manifestation of the goal, I guess &#8212; the disks and the i/o scheduler have no direct insight into the nature of the goal at all other than that.</p>
<p>We can then look at what variations might occur and theorise what effect that might have. eg. </p>
<p>* some reads requesting the same data as others, and hitting disk/controller cache.<br />
* some index range scans issuing a very large number of small read requests, and each of those requests being delayed in favour of multiblock reads and hence suffering from performance degradation.</p>
<p>I&#8217;ll post a chart or two, Doug, which show good support for the way that the anticipatory scheduler keeps the disks reading at close to their theoretical capacity even in this type of multiuser environment. As soon as I&#8217;ve taken the darned kids to daycare &#8211;i can hear them arguing about a monkey right now.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Doug Burns</title>
		<link>http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/#comment-2488</link>
		<dc:creator>Doug Burns</dc:creator>
		<pubDate>Mon, 02 Oct 2006 08:58:03 +0000</pubDate>
		<guid isPermaLink="false">http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/#comment-2488</guid>
		<description>Interesting blog, Dave.

"I have some preliminary results, by the way. The anticipatory scheduler is showing much better performance and, maybe as importantly, stability of performance with variations in degree of parallelism, even up to degrees that represent gross over-parallelism (DoP of 30 for data on 4 disks)."

I think the major possible failing in this type of test if it's based on a single user because once those other tricky users start wanting to do stuff too, it's going to undermine the purity of the tests. i.e. contiguous storage and the best scheduler might be very important to the importance of improving a single job but the blasted things tend to be multi-user ;-) (Oh, and I don't see that as the same as PX being 'multi-user' because there's co-ordination of the activity towards a single end-goal)</description>
		<content:encoded><![CDATA[<p>Interesting blog, Dave.</p>
<p>&#8220;I have some preliminary results, by the way. The anticipatory scheduler is showing much better performance and, maybe as importantly, stability of performance with variations in degree of parallelism, even up to degrees that represent gross over-parallelism (DoP of 30 for data on 4 disks).&#8221;</p>
<p>I think the major possible failing in this type of test if it&#8217;s based on a single user because once those other tricky users start wanting to do stuff too, it&#8217;s going to undermine the purity of the tests. i.e. contiguous storage and the best scheduler might be very important to the importance of improving a single job but the blasted things tend to be multi-user ;-) (Oh, and I don&#8217;t see that as the same as PX being &#8216;multi-user&#8217; because there&#8217;s co-ordination of the activity towards a single end-goal)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David Aldridge</title>
		<link>http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/#comment-2440</link>
		<dc:creator>David Aldridge</dc:creator>
		<pubDate>Fri, 29 Sep 2006 18:48:12 +0000</pubDate>
		<guid isPermaLink="false">http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/#comment-2440</guid>
		<description>&#62;&#62; many ERP or financial systems still have batch jobs, partcularly end-of-period processing.

One of the issues I'm also looking at is whether it is possible to change the scheduling algorithm for a device dynamically -- ie. without unmounting/mounting. It would anable the devices to be put into OLTP and batch-sympathetic i/o scheduling modes at different times of the day/month.

I have some preliminary results, by the way. The anticipatory scheduler is showing much better performance and, maybe as importantly, stability of performance with variations in degree of parallelism, even up to degrees that represent gross over-parallelism (DoP of 30 for data on 4 disks).</description>
		<content:encoded><![CDATA[<p>&gt;&gt; many ERP or financial systems still have batch jobs, partcularly end-of-period processing.</p>
<p>One of the issues I&#8217;m also looking at is whether it is possible to change the scheduling algorithm for a device dynamically &#8212; ie. without unmounting/mounting. It would anable the devices to be put into OLTP and batch-sympathetic i/o scheduling modes at different times of the day/month.</p>
<p>I have some preliminary results, by the way. The anticipatory scheduler is showing much better performance and, maybe as importantly, stability of performance with variations in degree of parallelism, even up to degrees that represent gross over-parallelism (DoP of 30 for data on 4 disks).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: joel garry</title>
		<link>http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/#comment-2439</link>
		<dc:creator>joel garry</dc:creator>
		<pubDate>Fri, 29 Sep 2006 18:24:34 +0000</pubDate>
		<guid isPermaLink="false">http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/#comment-2439</guid>
		<description>Regarding Noon's point 2 and conclusion, and the DW context response:  I'm strongly skewed towards OLTP, but in reality many ERP or financial systems still have batch jobs, partcularly end-of-period processing.  These jobs likely have at least some full table (or partition) scans, and are likely very visible to management in terms of total clock time run.  So this sort of performance issue becomes important, perhaps (if one puts costed and ordered business requirements into the equation) more so than the OLTP operations, which are going to thrash anyways by their nature.

I've been idly wondering if using RMAN as a benchmark for sequential read operations v. parallelism might be a reasonable idea.  For example, to test whether 30 2G files v. 1 60G file make a difference because of round-robinning - imp into each, then see how long RMAN takes to backup each with various numbers of channels.

Wish I had a hobby!  :-)</description>
		<content:encoded><![CDATA[<p>Regarding Noon&#8217;s point 2 and conclusion, and the DW context response:  I&#8217;m strongly skewed towards OLTP, but in reality many ERP or financial systems still have batch jobs, partcularly end-of-period processing.  These jobs likely have at least some full table (or partition) scans, and are likely very visible to management in terms of total clock time run.  So this sort of performance issue becomes important, perhaps (if one puts costed and ordered business requirements into the equation) more so than the OLTP operations, which are going to thrash anyways by their nature.</p>
<p>I&#8217;ve been idly wondering if using RMAN as a benchmark for sequential read operations v. parallelism might be a reasonable idea.  For example, to test whether 30 2G files v. 1 60G file make a difference because of round-robinning - imp into each, then see how long RMAN takes to backup each with various numbers of channels.</p>
<p>Wish I had a hobby!  :-)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David Aldridge</title>
		<link>http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/#comment-2436</link>
		<dc:creator>David Aldridge</dc:creator>
		<pubDate>Fri, 29 Sep 2006 15:23:04 +0000</pubDate>
		<guid isPermaLink="false">http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/#comment-2436</guid>
		<description>Gandolf,

I don't have to worry about it myself, I just have to worry about what algorithm is managing it. I know that disk head movement represents an inefficiency that can theoretically be reduced by the choice of the correct scheduler, so all I need to do is choose the right one and provide the conditions for it to perform optimally (ie. large sequential reads, reasonably contiguous data to read).

With regard to a SAN, I suggest that you make it very clear to the administrators that they n0ow have an application with predicatable access patterns -- say large sequential reads of logically contiguous data. If their answer to problems is that they have enough disks and cache to cope with anything that can be thrown at it then consider how much better performance could be if they would actually optimise the storage in line with the application's use of it. Recall also how the attitude to RAID5 has changed over the years -- no longer is the "enough disks and cache" thought to be an adequate response to the problems inherent in the technology.

By the way, this is my hobby! :D Sad but true.</description>
		<content:encoded><![CDATA[<p>Gandolf,</p>
<p>I don&#8217;t have to worry about it myself, I just have to worry about what algorithm is managing it. I know that disk head movement represents an inefficiency that can theoretically be reduced by the choice of the correct scheduler, so all I need to do is choose the right one and provide the conditions for it to perform optimally (ie. large sequential reads, reasonably contiguous data to read).</p>
<p>With regard to a SAN, I suggest that you make it very clear to the administrators that they n0ow have an application with predicatable access patterns &#8212; say large sequential reads of logically contiguous data. If their answer to problems is that they have enough disks and cache to cope with anything that can be thrown at it then consider how much better performance could be if they would actually optimise the storage in line with the application&#8217;s use of it. Recall also how the attitude to RAID5 has changed over the years &#8212; no longer is the &#8220;enough disks and cache&#8221; thought to be an adequate response to the problems inherent in the technology.</p>
<p>By the way, this is my hobby! :D Sad but true.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David Aldridge</title>
		<link>http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/#comment-2435</link>
		<dc:creator>David Aldridge</dc:creator>
		<pubDate>Fri, 29 Sep 2006 15:16:51 +0000</pubDate>
		<guid isPermaLink="false">http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/#comment-2435</guid>
		<description>Pete,

If ASM scatters blocks around in a way that promotes excessive head movement, in comparison to the regular management, then I'd expect to see a performance issue. If there was a performance issue then that would be a black mark against its use. however, I don't think it does (though I may be wrong).

Does the issue of the slowing of one read matter when other process are being equally served at the same time? That's a question at the very heart of i/o scheduling (and queueing theory, which is what alll this is about) -- what is meant by "equally served"? If it means that the scheduler finishes with one read request and then immediately moves the disk heads to satisfy another, then moves the heads back to satisfy another request from the first process then that might seem equitable, but it's exactly analagous to a single check-in handling multiple check-ins at the airport at the same time. 

Consider passenger A and passenger B, both waiting to be served. To check in each passenger takes five minutes, so passenger A is checked in in five minutes and passenger B waits for five minutes then gets checked in and is gone after a total wait of ten minutes. If, in an effort to be equitable to both parties, the check-in agent flits between the two then the total time to check them both in is now eleven minutes (taking into account a total latency of one minute due to walking between the desks), and they both wait the full eleven minutes to be finished. Not equitable at all!</description>
		<content:encoded><![CDATA[<p>Pete,</p>
<p>If ASM scatters blocks around in a way that promotes excessive head movement, in comparison to the regular management, then I&#8217;d expect to see a performance issue. If there was a performance issue then that would be a black mark against its use. however, I don&#8217;t think it does (though I may be wrong).</p>
<p>Does the issue of the slowing of one read matter when other process are being equally served at the same time? That&#8217;s a question at the very heart of i/o scheduling (and queueing theory, which is what alll this is about) &#8212; what is meant by &#8220;equally served&#8221;? If it means that the scheduler finishes with one read request and then immediately moves the disk heads to satisfy another, then moves the heads back to satisfy another request from the first process then that might seem equitable, but it&#8217;s exactly analagous to a single check-in handling multiple check-ins at the airport at the same time. </p>
<p>Consider passenger A and passenger B, both waiting to be served. To check in each passenger takes five minutes, so passenger A is checked in in five minutes and passenger B waits for five minutes then gets checked in and is gone after a total wait of ten minutes. If, in an effort to be equitable to both parties, the check-in agent flits between the two then the total time to check them both in is now eleven minutes (taking into account a total latency of one minute due to walking between the desks), and they both wait the full eleven minutes to be finished. Not equitable at all!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David Aldridge</title>
		<link>http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/#comment-2434</link>
		<dc:creator>David Aldridge</dc:creator>
		<pubDate>Fri, 29 Sep 2006 15:05:39 +0000</pubDate>
		<guid isPermaLink="false">http://oraclesponge.wordpress.com/2006/09/28/linux-26-kernel-io-schedulers-for-oracle-data-warehousing-part-i/#comment-2434</guid>
		<description>Noons ...


For general purpose/OLTP databases I think that your points are very valid, but don't neglect the context -- data warehousing, where jumping through hoops to achieve goals that make no sense in other worlds is practically a way of life. This is by no means a "one size gfits all" topic -- everything here is qualified by "Data Warehousing only"

The scan doesn't have to be 100% contiguous through the entire segment, we just have to avoid the kind of excessive disk head movement that would harm performance. From the Oracle side that means allocating reasonably large extents -- at least 32-64 Mb I'd think, although in practice we can easily go larger than that, into the Gb range for large data sets (the practical upper limit would be the least of the partition size and the ETL single load volume, divided by the number of devices over which they are spread, off the top of my head).

We also want to avoid ASSM -- in the data warehouse we almost certainly want to promote a physical row order to either promote efficient compression or to lower the clustering factor for some set of columns indexes. ASSM does not provide a solution to any problem we encounter in DW, so I would not use it. PCTFREE 0, PCTUSED 99, COMPRESS, and use direct path to load the data.

I don't think that parallel query is such a problem either -- Oracle will attempt to allocate ranges to each PQ slave based on datafile if possible, and in any case will allocate ranges of rowid's that make each slave's target data set logically contiguous. Whatever mysterious algorithms govern the allocation for PQ slaves must surely allocate ranges with a view to logical contiguity (if that is a word) or the result would be terribly messy.

Multiple data files ought also to be no problem. Large extents and allocating new extents in multiples of the number of data files appears to take care of that (on LMT), although I haven't run a rigourous test on it. Maybe that's next :D

I don't think that freelists, freelist groups and ASSM are relevant to PQ.

Now it could well be that it's difficult to just say, "Scheduler X is the best for databases" (in fact I'm sure it is), but if we characterise databases by access patterns ("small random access with reads and writes mixed", "large sequential access with reads during the day and writes at night") then there ought to be a theoretical best choice, and we ought to be able to run a test to demonstrate that. In the "real world" we also ought to be able to benchmark applications and look at mean response times over the course of a week ora month when running different schedulers -- maybe our theory gets supported and shows us and average response time 30% lower with scheduler X than scheduler Y. The important thing I think is to be aware that there are these options and to understand what the choices imply in terms of how they interact with our applications' access patterns.

Here's something else to consider -- in 2.6.18 the default scheduler changes from anticipatory to CFQ http://linux.inet.hr/cfq_to_become_the_default_i_o_scheduler.html What impact is that going to have on your application? Maybe it'll be an improvement, maybe it'll counteract benefits experienced elsewhere. Maybe when you upgrade through that boundary you want to start passing "elevator=as" to the kernel at boot time to prevent the scheduler change.</description>
		<content:encoded><![CDATA[<p>Noons &#8230;</p>
<p>For general purpose/OLTP databases I think that your points are very valid, but don&#8217;t neglect the context &#8212; data warehousing, where jumping through hoops to achieve goals that make no sense in other worlds is practically a way of life. This is by no means a &#8220;one size gfits all&#8221; topic &#8212; everything here is qualified by &#8220;Data Warehousing only&#8221;</p>
<p>The scan doesn&#8217;t have to be 100% contiguous through the entire segment, we just have to avoid the kind of excessive disk head movement that would harm performance. From the Oracle side that means allocating reasonably large extents &#8212; at least 32-64 Mb I&#8217;d think, although in practice we can easily go larger than that, into the Gb range for large data sets (the practical upper limit would be the least of the partition size and the ETL single load volume, divided by the number of devices over which they are spread, off the top of my head).</p>
<p>We also want to avoid ASSM &#8212; in the data warehouse we almost certainly want to promote a physical row order to either promote efficient compression or to lower the clustering factor for some set of columns indexes. ASSM does not provide a solution to any problem we encounter in DW, so I would not use it. PCTFREE 0, PCTUSED 99, COMPRESS, and use direct path to load the data.</p>
<p>I don&#8217;t think that parallel query is such a problem either &#8212; Oracle will attempt to allocate ranges to each PQ slave based on datafile if possible, and in any case will allocate ranges of rowid&#8217;s that make each slave&#8217;s target data set logically contiguous. Whatever mysterious algorithms govern the allocation for PQ slaves must surely allocate ranges with a view to logical contiguity (if that is a word) or the result would be terribly messy.</p>
<p>Multiple data files ought also to be no problem. Large extents and allocating new extents in multiples of the number of data files appears to take care of that (on LMT), although I haven&#8217;t run a rigourous test on it. Maybe that&#8217;s next :D</p>
<p>I don&#8217;t think that freelists, freelist groups and ASSM are relevant to PQ.</p>
<p>Now it could well be that it&#8217;s difficult to just say, &#8220;Scheduler X is the best for databases&#8221; (in fact I&#8217;m sure it is), but if we characterise databases by access patterns (&#8221;small random access with reads and writes mixed&#8221;, &#8220;large sequential access with reads during the day and writes at night&#8221;) then there ought to be a theoretical best choice, and we ought to be able to run a test to demonstrate that. In the &#8220;real world&#8221; we also ought to be able to benchmark applications and look at mean response times over the course of a week ora month when running different schedulers &#8212; maybe our theory gets supported and shows us and average response time 30% lower with scheduler X than scheduler Y. The important thing I think is to be aware that there are these options and to understand what the choices imply in terms of how they interact with our applications&#8217; access patterns.</p>
<p>Here&#8217;s something else to consider &#8212; in 2.6.18 the default scheduler changes from anticipatory to CFQ <a href="http://linux.inet.hr/cfq_to_become_the_default_i_o_scheduler.html" rel="nofollow">http://linux.inet.hr/cfq_to_become_the_default_i_o_scheduler.html</a> What impact is that going to have on your application? Maybe it&#8217;ll be an improvement, maybe it&#8217;ll counteract benefits experienced elsewhere. Maybe when you upgrade through that boundary you want to start passing &#8220;elevator=as&#8221; to the kernel at boot time to prevent the scheduler change.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
