Changes from V8===============- Implemented bdi like congestion semantics for io group also. Now once an io group gets congested, we don't clear the congestion flag until number of requests goes below nr_congestion_off. This helps in getting rid of Buffered write performance regression we were observing with io controller patches.

Gui, can you please test it and see if this version is better in terms of your buffered write tests.

- Moved some of the functions from blk-core.c to elevator-fq.c. This reduces CONFIG_GROUP_IOSCHED ifdefs in blk-core.c and code looks little more clean.

- Fixed issue of add_front where we go left on rb-tree if add_front is specified in case of preemption.

- Pulled in v11 of io tracking patches and modified config option so that if CONFIG_TRACK_ASYNC_CONTEXT is not enabled, blkio is not compiled in.

- Fixed some block tracepoints which were broken because of per group request list changes.

- Fixed some logging messages.

- Got rid of extra call to update_prio as pointed out by Jerome and Gui.

- Merged the fix from jerome for a crash while chaning prio.

- Got rid of redundant slice_start assignment as pointed by Gui.

- Merged a elv_ioq_nr_dispatched() cleanup from Gui.

- Fixed a compilation issue if CONFIG_BLOCK=n.

What problem are we trying to solve===================================Provide group IO scheduling feature in Linux along the lines of other resourcecontrollers like cpu.IOW, provide facility so that a user can group applications using cgroups andcontrol the amount of disk time/bandwidth received by a group based on itsweight.

How to solve the problem=========================Different people have solved the issue differetnly. At least there are nowthree patchsets available (including this one).

IO throttling-------------This is a bandwidth controller which keeps track of IO rate of a group andthrottles the process in the group if it exceeds the user specified limit.dm-ioband---------This is a proportional bandwidth controller implemented as device mapperdriver and provides fair access in terms of amount of IO done (not in termsof disk time as CFQ does).So one will setup one or more dm-ioband devices on top of physical/logicalblock device, configure the ioband device and pass information like groupingetc. Now this device will keep track of bios flowing through it and controlthe flow of bios based on group policies.

IO scheduler based IO controller--------------------------------Here I have viewed the problem of IO contoller as hierarchical group scheduling (along the lines of CFS group scheduling) issue. Currently one can view linuxIO schedulers as flat where there is one root group and all the IO belongs tothat group.This patchset basically modifies IO schedulers to also support hierarchicalgroup scheduling. CFQ already provides fairness among different processes. I have extended it support group IO schduling. Also took some of the code outof CFQ and put in a common layer so that same group scheduling code can beused by noop, deadline and AS to support group scheduling.

Pros/Cons=========There are pros and cons to each of the approach. Following are some of thethoughts.- IO throttling is a max bandwidth controller and not a proportional one. Additionaly it provides fairness in terms of amount of IO done (and not in terms of disk time as CFQ does).

Personally, I think that proportional weight controller is useful to more people than just max bandwidth controller. In addition, IO scheduler based controller can also be enhanced to do max bandwidth control, if need be.

- dm-ioband also provides fairness in terms of amount of IO done not in terms of disk time. So a seeky process can still run away with lot more disk time. Now this is an interesting question that how fairness among groups should be viewed and what is more relevant. Should fairness be based on amount of IO done or amount of disk time consumed as CFQ does. IO scheduler based controller provides fairness in terms of disk time used.

- IO throttling and dm-ioband both are second level controller. That is these controllers are implemented in higher layers than io schedulers. So they control the IO at higher layer based on group policies and later IO schedulers take care of dispatching these bios to disk.

Implementing a second level controller has the advantage of being able to provide bandwidth control even on logical block devices in the IO stack which don't have any IO schedulers attached to these. But they can also interefere with IO scheduling policy of underlying IO scheduler and change the effective behavior. Following are some of the issues which I think should be visible in second level controller in one form or other.

Prio with-in group ------------------ A second level controller can potentially interefere with behavior of different prio processes with-in a group. bios are buffered at higher layer in single queue and release of bios is FIFO and not proportionate to the ioprio of the process. This can result in a particular prio level not getting fair share. Buffering at higher layer can delay read requests for more than slice idle period of CFQ (default 8 ms). That means, it is possible that we are waiting for a request from the queue but it is buffered at higher layer and then idle timer will fire. It means that queue will losse its share at the same time overall throughput will be impacted as we lost those 8 ms.

Read Vs Write ------------- Writes can overwhelm readers hence second level controller FIFO release will run into issue here. If there is a single queue maintained then reads will suffer large latencies. If there separate queues for reads and writes then it will be hard to decide in what ratio to dispatch reads and writes as it is IO scheduler's decision to decide when and how much read/write to dispatch. This is another place where higher level controller will not be in sync with lower level io scheduler and can change the effective policies of underlying io scheduler. Fairness in terms of disk time / size of IO --------------------------------------------- An higher level controller will most likely be limited to providing fairness in terms of size of IO done and will find it hard to provide fairness in terms of disk time used (as CFQ provides between various prio levels). This is because only IO scheduler knows how much disk time a queue has used. Not sure how useful it is to have fairness in terms of secotrs as CFQ has been providing fairness in terms of disk time. So a seeky application will still run away with lot of disk time and bring down the overall throughput of the the disk more than usual.

CFQ IO context Issues --------------------- Buffering at higher layer means submission of bios later with the help of a worker thread. This changes the io context information at CFQ layer which assigns the request to submitting thread. Change of io context info again leads to issues of idle timer expiry and issue of a process not getting fair share and reduced throughput. Throughput with noop, deadline and AS --------------------------------------------- I think an higher level controller will result in reduced overall throughput (as compared to io scheduler based io controller) and more seeks with noop, deadline and AS. The reason being, that it is likely that IO with-in a group will be related and will be relatively close as compared to IO across the groups. For example, thread pool of kvm-qemu doing IO for virtual machine. In case of higher level control, IO from various groups will go into a single queue at lower level controller and it might happen that IO is now interleaved (G1, G2, G1, G3, G4....) causing more seeks and reduced throughput. (Agreed that merging will help up to some extent but still....). Instead, in case of lower level controller, IO scheduler maintains one queue per group hence there is no interleaving of IO between groups. And if IO is related with-in group, then we shoud get reduced number/amount of seek and higher throughput.

Latency can be a concern but that can be controlled by reducing the time slice length of the queue.

- IO scheduler based controller has the limitation that it works only with the bottom most devices in the IO stack where IO scheduler is attached. Now the question comes that how important/relevant it is to control bandwidth at higher level logical devices also. The actual contention for resources is at the leaf block device so it probably makes sense to do any kind of control there and not at the intermediate devices. Secondly probably it also means better use of available resources.

For example, assume a user has created a linear logical device lv0 using three underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2 in two groups doing IO on lv0. Also assume that weights of groups are in the ratio of 2:1 so T1 should get double the BW of T2 on lv0 device.

T1 T2 \ / lv0 / | \ sda sdb sdc Now if IO control is done at lv0 level, then if T1 is doing IO to only sda, and T2's IO is going to sdc. In this case there is no need of resource management as both the IOs don't have any contention where it matters. If we try to do IO control at lv0 device, it will not be an optimal usage of resources and will bring down overall throughput.

IMHO, IO scheduler based IO controller is a reasonable approach to solve theproblem of group bandwidth control, and can do hierarchical IO schedulingmore tightly and efficiently. But I am all ears to alternative approaches andsuggestions how doing things can be done better.

TODO====- code cleanups, testing, bug fixing, optimizations, benchmarking etc...- More testing to make sure there are no regressions in CFQ.Open Issues===========- Currently for async requests like buffered writes, we get the io group information from the page instead of the task context. How important it is to determine the context from page? Can we put all the pdflush threads into a separate group and control system wide buffered write bandwidth. Any buffered writes submitted by the process directly will any way go to right group.

If it is acceptable then we can drop all the code associated with async io context and that should simplify the patchset a lot.

Testing=======I have divided testing results in three sections. - Latency- Throughput and Fairness- Group Fairness

Because I have enhanced CFQ to also do group scheduling, one of the concernshas been that existing CFQ should not regress at least in flat setup. Ifone creates groups and puts tasks in those, then this is new environment andsome properties can change because groups have this additional requirementof providing isolation also.

In third column, 10 readers have been put into 10 groups instead of runninginto root group. Small file reader runs in to root group.

Notes: It looks like that here read latencies remain same as with vanilla CFQ.

Test3: read small files with multiple writers (8) runnning==========================================================Again running small file reader test with 8 buffered writers running withprio 0 to 7.Latency results are in seconds. Tried to capture the output with multipleconfigurations of IO controller to see the effect.

f(0/1)--> refers to "fairness" tunable. This is new tunable part of CFQ. It set, we wait for requests from one queue to finish before new queue is scheduled in.

group ---> writers are running into individual groups and not in root group.map---> buffered writes are mapped to group using info stored in page.

Notes: Except the case of column 6 and 7 when writeres are in separate groupsand we are mapping their writes to respective group, latencies seem to befine. I think the latencies are higher for the last two cases becausenow the reader can't preempt the writer.

root / \ \ \ R G1 G2 G3 | | | W W WTest4: Random Reader test in presece of 4 sequential readers and 4 buffered writers============================================================================Used fio to this time to run one random reader and see how does it fair inthe presence of 4 sequential readers and 4 writers.I have just pasted the output of random reader from fio.

Notes:- It looks like overall throughput is 1-3% less in case of io controller.- Bandwidth distribution between various prio levels has changed a bit. CFQ seems to have 100ms slice length for prio4 and then this slice increases by 20% for each prio level as prio increases and decreases by 20% as prio levels decrease. So Io controller does not seem to be doing too bad as in meeting that distribution.

Group Fairness+++++++++++++++Test7 (Isolation between two KVM virtual machines)==================================================Created two KVM virtual machines. Partitioned a disk on host in two partitionsand gave one partition to each virtual machine. Put both the virtual machinesin two different cgroup of weight 1000 and 500 each. Virtual machines createdext3 file system on the partitions exported from host and did buffered writes.Host seems writes as synchronous and virtual machine with higher weight getsdouble the disk time of virtual machine of lower weight. Used deadlinescheduler in this test case.Some more details about configuration are in documentation patch.

Test8 (Fairness for synchronous reads)======================================- Two dd in two cgroups with cgrop weights 1000 and 500. Ran two "dd" in those cgroups (With CFQ scheduler and /sys/block/<device>/queue/fairness = 1) Higher weight dd finishes first and at that point of time my script takes care of reading cgroup files io.disk_time and io.disk_sectors for both the groups and display the results.

First two fields in time and sectors statistics represent major and minornumber of the device. Third field represents disk time in milliseconds andnumber of sectors transferred respectively.

This patchset tries to provide fairness in terms of disk time received. group1got almost double of group2 disk time (At the time of first dd finish). Thesetime and sectors statistics can be read using io.disk_time and io.disk_sectorfiles in cgroup. More about it in documentation file.

Test9 (Reader Vs Buffered Writes)================================Buffered writes can be problematic and can overwhelm readers, especially withnoop and deadline. IO controller can provide isolation between readers andbuffered (async) writers.First I ran the test without io controller to see the severity of the issue.Ran a hostile writer and then after 10 seconds started a reader and thenmonitored the completion time of reader. Reader reads a 256 MB file. Testedthis with noop scheduler.

sample script------------syncecho 3 > /proc/sys/vm/drop_cachestime dd if=/dev/zero of=/mnt/sdb/reader-writer-zerofile bs=4K count=2097152conv=fdatasync &sleep 10time dd if=/mnt/sdb/256M-file of=/dev/null &Results-------8589934592 bytes (8.6 GB) copied, 106.045 s, 81.0 MB/s (Writer)268435456 bytes (268 MB) copied, 96.5237 s, 2.8 MB/s (Reader)Now it was time to test io controller whether it can provide isolation betweenreaders and writers with noop. I created two cgroups of weight 1000 each andput reader in group1 and writer in group 2 and ran the test again. Uponcomletion of reader, my scripts read io.disk_time and io.disk_sectors cgroupfiles to get an estimate how much disk time each group got and how manysectors each group did IO for.

For more accurate accounting of disk time for buffered writes with queuinghardware I had to set /sys/block/<disk>/queue/iosched/fairness to "1".

Above shows that by the time first fio (higher weight), finished, grouptest1 got 17686 ms of disk time and group test2 got 9036 ms of disk time.similarly the statistics for number of sectors transferred are also shown.

Note that disk time given to group test1 is almost double of group2 disktime.

AIO writes----------Set up two fio, AIO direct write jobs in two cgroup with weight 1000 and 500respectively. I am using cfq scheduler. Following are some lines from my testscript.------------------------------------------------echo 1000 > /cgroup/bfqio/test1/io.weightecho 500 > /cgroup/bfqio/test2/io.weightfio_args="--ioengine=libaio --rw=write --size=512M --direct=1"echo 1 > /sys/block/$BLOCKDEV/queue/iosched/fairnessecho $$ > /cgroup/bfqio/test1/tasksfio $fio_args --name=test1 --directory=/mnt/$BLOCKDEV/fio1/--output=/mnt/$BLOCKDEV/fio1/test1.log--exec_postrun="../read-and-display-group-stats.sh $maj_dev $minor_dev" &echo $$ > /cgroup/bfqio/test2/tasksfio $fio_args --name=test2 --directory=/mnt/$BLOCKDEV/fio2/--output=/mnt/$BLOCKDEV/fio2/test2.log &-------------------------------------------------test1 and test2 are two groups with weight 1000 and 500 respectively."read-and-display-group-stats.sh" is one small script which reads thetest1 and test2 cgroup files to determine how much disk time each groupgot till first fio job finished.

Above shows that by the time first fio (higher weight), finished, grouptest1 got almost double the disk time of group test2.

Test11 (Fairness for async writes, Buffered Write Vs Buffered Write)===================================================================Fairness for async writes is tricky and biggest reason is that async writesare cached in higher layers (page cahe) as well as possibly in file systemlayer also (btrfs, xfs etc), and are dispatched to lower layers not necessarilyin proportional manner.For example, consider two dd threads reading /dev/zero as input file and doingwrites of huge files. Very soon we will cross vm_dirty_ratio and dd thread willbe forced to write out some pages to disk before more pages can be dirtied. Butnot necessarily dirty pages of same thread are picked. It can very well pickthe inode of lesser priority dd thread and do some writeout. So effectivelyhigher weight dd is doing writeouts of lower weight dd pages and we don't seeservice differentation.

IOW, the core problem with buffered write fairness is that higher weight threaddoes not throw enought IO traffic at IO controller to keep the queuecontinuously backlogged. In my testing, there are many .2 to .8 secondintervals where higher weight queue is empty and in that duration lower weightqueue get lots of job done giving the impression that there was no servicedifferentiation.In summary, from IO controller point of view async writes support is there.Because page cache has not been designed in such a manner that higher prio/weight writer can do more write out as compared to lower prio/weightwriter, gettting service differentiation is hard and it is visible in somecases and not visible in some cases.