The world won't listenhttp://blogs.igalia.com/berto
Alberto Garcia's blogWed, 25 May 2016 13:32:27 +0000en-UShourly1https://wordpress.org/?v=4.6.1I/O bursts with QEMU 2.6http://blogs.igalia.com/berto/2016/05/24/io-bursts-with-qemu-2-6/
http://blogs.igalia.com/berto/2016/05/24/io-bursts-with-qemu-2-6/#commentsTue, 24 May 2016 11:47:00 +0000http://blogs.igalia.com/berto/?p=613Continue reading I/O bursts with QEMU 2.6→]]>QEMU 2.6 was released a few days ago. One new feature that I have been working on is the new way to configure I/O limits in disk drives to allow bursts and increase the responsiveness of the virtual machine. In this post I’ll try to explain how it works.

The basic settings

First I will summarize the basic settings that were already available in earlier versions of QEMU.

Two aspects of the disk I/O can be limited: the number of bytes per second and the number of operations per second (IOPS). For each one of them the user can set a global limit or separate limits for read and write operations. This gives us a total of six different parameters.

I/O limits can be set using the throttling.* parameters of -drive, or using the QMP block_set_io_throttle command. These are the names of the parameters for both cases:

-drive

block_set_io_throttle

throttling.iops-total

iops

throttling.iops-read

iops_rd

throttling.iops-write

iops_wr

throttling.bps-total

bps

throttling.bps-read

bps_rd

throttling.bps-write

bps_wr

It is possible to set limits for both IOPS and bps at the same time, and for each case we can decide whether to have separate read and write limits or not, but if iops-total is set then neither iops-read nor iops-write can be set. The same applies to bps-total and bps-read/write.

The default value of these parameters is 0, and it means unlimited.

In its most basic usage, the user can add a drive to QEMU with a limit of, say, 100 IOPS with the following -drive line:

-drive file=hd0.qcow2,throttling.iops-total=100

We can do the same using QMP. In this case all these parameters are mandatory, so we must set to 0 the ones that we don’t want to limit:

I/O bursts

While the settings that we have just seen are enough to prevent the virtual machine from performing too much I/O, it can be useful to allow the user to exceed those limits occasionally. This way we can have a more responsive VM that is able to cope better with peaks of activity while keeping the average limits lower the rest of the time.

Starting from QEMU 2.6, it is possible to allow the user to do bursts of I/O for a configurable amount of time. A burst is an amount of I/O that can exceed the basic limit, and there are two parameters that control them: their length and the maximum amount of I/O they allow. These two can be configured separately for each one of the six basic parameters described in the previous section, but here we’ll use ‘iops-total’ as an example.

The I/O limit during bursts is set using ‘iops-total-max’, and the maximum length (in seconds) is set with ‘iops-total-max-length’. So if we want to configure a drive with a basic limit of 100 IOPS and allow bursts of 2000 IOPS for 60 seconds, we would do it like this (the line is split for clarity):

With this, the user can perform I/O on hd0.qcow2 at a rate of 2000 IOPS for 1 minute before it’s throttled down to 100 IOPS.

The user will be able to do bursts again if there’s a sufficiently long period of time with unused I/O (see below for details).

The default value for ‘iops-total-max’ is 0 and it means that bursts are not allowed. ‘iops-total-max-length’ can only be set if ‘iops-total-max’ is set as well, and its default value is 1 second.

Controlling the size of I/O operations

When applying IOPS limits all I/O operations are treated equally regardless of their size. This means that the user can take advantage of this in order to circumvent the limits and submit one huge I/O request instead of several smaller ones.

QEMU provides a setting called throttling.iops-size to prevent this from happening. This setting specifies the size (in bytes) of an I/O request for accounting purposes. Larger requests will be counted proportionally to this size.

For example, if iops-size is set to 4096 then an 8KB request will be counted as two, and a 6KB request will be counted as one and a half. This only applies to requests larger than iops-size: smaller requests will be always counted as one, no matter their size.

The default value of iops-size is 0 and it means that the size of the requests is never taken into account when applying IOPS limits.

Applying I/O limits to groups of disks

In all the examples so far we have seen how to apply limits to the I/O performed on individual drives, but QEMU allows grouping drives so they all share the same limits.

The Leaky Bucket algorithm

I/O limits in QEMU are implemented using the leaky bucket algorithm (specifically the “Leaky bucket as a meter” variant).

This algorithm uses the analogy of a bucket that leaks water constantly. The water that gets into the bucket represents the I/O that has been performed, and no more I/O is allowed once the bucket is full.

To see the way this corresponds to the throttling parameters in QEMU, consider the following values:

iops-total=100
iops-total-max=2000
iops-total-max-length=60

Water leaks from the bucket at a rate of 100 IOPS.

Water can be added to the bucket at a rate of 2000 IOPS.

The size of the bucket is 2000 x 60 = 120000.

If iops-total-max is unset then the bucket size is 100.

The bucket is initially empty, therefore water can be added until it’s full at a rate of 2000 IOPS (the burst rate). Once the bucket is full we can only add as much water as it leaks, therefore the I/O rate is reduced to 100 IOPS. If we add less water than it leaks then the bucket will start to empty, allowing for bursts again.

Note that since water is leaking from the bucket even during bursts, it will take a bit more than 60 seconds at 2000 IOPS to fill it up. After those 60 seconds the bucket will have leaked 60 x 100 = 6000, allowing for 3 more seconds of I/O at 2000 IOPS.

Also, due to the way the algorithm works, longer burst can be done at a lower I/O rate, e.g. 1000 IOPS during 120 seconds.

Acknowledgments

As usual, my work in QEMU is sponsored by Outscale and has been made possible by Igalia and the help of the QEMU development team.

Enjoy QEMU 2.6!

]]>http://blogs.igalia.com/berto/2016/05/24/io-bursts-with-qemu-2-6/feed/1Improving disk I/O performance in QEMU 2.5 with the qcow2 L2 cachehttp://blogs.igalia.com/berto/2015/12/17/improving-disk-io-performance-in-qemu-2-5-with-the-qcow2-l2-cache/
http://blogs.igalia.com/berto/2015/12/17/improving-disk-io-performance-in-qemu-2-5-with-the-qcow2-l2-cache/#commentsThu, 17 Dec 2015 15:39:27 +0000http://blogs.igalia.com/berto/?p=591Continue reading Improving disk I/O performance in QEMU 2.5 with the qcow2 L2 cache→]]>QEMU 2.5 has just been released, with a lot of new features. As with the previous release, we have also created a video changelog.

I plan to write a few blog posts explaining some of the things I have been working on. In this one I’m going to talk about how to control the size of the qcow2 L2 cache. But first, let’s see why that cache is useful.

The qcow2 file format

qcow2 is the main format for disk images used by QEMU. One of the features of this format is that its size grows on demand, and the disk space is only allocated when it is actually needed by the virtual machine.

A qcow2 file is organized in units of constant size called clusters. The virtual disk seen by the guest is also divided into guest clusters of the same size. QEMU defaults to 64KB clusters, but a different value can be specified when creating a new image:

qemu-img create -f qcow2 -o cluster_size=128K hd.qcow2 4G

In order to map the virtual disk as seen by the guest to the qcow2 image in the host, the qcow2 image contains a set of tables organized in a two-level structure. These are called the L1 and L2 tables.

There is one single L1 table per disk image. This table is small and is always kept in memory.

There can be many L2 tables, depending on how much space has been allocated in the image. Each table is one cluster in size. In order to read or write data to the virtual disk, QEMU needs to read its corresponding L2 table to find out where that data is located. Since reading the table for each I/O operation can be expensive, QEMU keeps a cache of L2 tables in memory to speed up disk access.

The L2 cache can have a dramatic impact on performance. As an example, here’s the number of I/O operations per second that I get with random read requests in a fully populated 20GB disk image:

L2 cache size

Average IOPS

1 MB

5100

1,5 MB

7300

2 MB

12700

2,5 MB

63600

If you’re using an older version of QEMU you might have trouble getting the most out of the qcow2 cache because of this bug, so either upgrade to at least QEMU 2.3 or apply this patch.

(in addition to the L2 cache, QEMU also keeps a refcount cache. This is used for cluster allocation and internal snapshots, but I’m not covering it in this post. Please refer to the qcow2 documentation if you want to know more about refcount tables)

Understanding how to choose the right cache size

In order to choose the cache size we need to know how it relates to the amount of allocated space.

The amount of virtual disk that can be mapped by the L2 cache (in bytes) is:

disk_size = l2_cache_size * cluster_size / 8

With the default values for cluster_size (64KB) that is

disk_size = l2_cache_size * 8192

So in order to have a cache that can cover n GB of disk space with the default cluster size we need

l2_cache_size = disk_size_GB * 131072

QEMU has a default L2 cache of 1MB (1048576 bytes) so using the formulas we’ve just seen we have 1048576 / 131072 = 8 GB of virtual disk covered by that cache. This means that if the size of your virtual disk is larger than 8 GB you can speed up disk access by increasing the size of the L2 cache. Otherwise you’ll be fine with the defaults.

How to configure the cache size

Cache sizes can be configured using the -drive option in the command-line, or the ‘blockdev-add‘ QMP command.

There are three options available, and all of them take bytes:

l2-cache-size: maximum size of the L2 table cache

refcount-cache-size: maximum size of the refcount block cache

cache-size: maximum size of both caches combined

There are two things that need to be taken into account:

Both the L2 and refcount block caches must have a size that is a multiple of the cluster size.

If you only set one of the options above, QEMU will automatically adjust the others so that the L2 cache is 4 times bigger than the refcount cache.

Although I’m not covering the refcount cache here, it’s worth noting that it’s used much less often than the L2 cache, so it’s perfectly reasonable to keep it small:

-drive file=hd.qcow2,l2-cache-size=4194304,refcount-cache-size=262144

Reducing the memory usage

The problem with a large cache size is that it obviously needs more memory. QEMU has a separate L2 cache for each qcow2 file, so if you’re using many big images you might need a considerable amount of memory if you want to have a reasonably sized cache for each one. The problem gets worse if you add backing files and snapshots to the mix.

Consider this scenario:

Here, hd0 is a fully populated disk image, and hd1 a freshly created image as a result of a snapshot operation. Reading data from this virtual disk will fill up the L2 cache of hd0, because that’s where the actual data is read from. However hd0 itself is read-only, and if you write data to the virtual disk it will go to the active image, hd1, filling up its L2 cache as a result. At some point you’ll have in memory cache entries from hd0 that you won’t need anymore because all the data from those clusters is now retrieved from hd1.

Let’s now create a new live snapshot:

Now we have the same problem again. If we write data to the virtual disk it will go to hd2 and its L2 cache will start to fill up. At some point a significant amount of the data from the virtual disk will be in hd2, however the L2 caches of hd0 and hd1 will be full as a result of the previous operations, even if they’re no longer needed.

Imagine now a scenario with several virtual disks and a long chain of qcow2 images for each one of them. See the problem?

I wanted to improve this a bit so I was working on a new setting that allows the user to reduce the memory usage by cleaning unused cache entries when they are not being used.

This new setting is available in QEMU 2.5, and is called ‘cache-clean-interval‘. It defines an interval (in seconds) after which all cache entries that haven’t been accessed are removed from memory.

This example removes all unused cache entries every 15 minutes:

-drive file=hd.qcow2,cache-clean-interval=900

If unset, the default value for this parameter is 0 and it disables this feature.

Further information

In this post I only intended to give a brief summary of the qcow2 L2 cache and how to tune it in order to increase the I/O performance, but it is by no means an exhaustive description of the disk format.

Acknowledgments

My work in QEMU is sponsored by Outscale and has been made possible by Igalia and the invaluable help of the QEMU development team.

Enjoy QEMU 2.5!

]]>http://blogs.igalia.com/berto/2015/12/17/improving-disk-io-performance-in-qemu-2-5-with-the-qcow2-l2-cache/feed/8I/O limits for disk groups in QEMU 2.4http://blogs.igalia.com/berto/2015/08/14/io-limits-for-disk-groups-in-qemu-2-4/
http://blogs.igalia.com/berto/2015/08/14/io-limits-for-disk-groups-in-qemu-2-4/#commentsFri, 14 Aug 2015 10:22:29 +0000http://blogs.igalia.com/berto/?p=566Continue reading I/O limits for disk groups in QEMU 2.4→]]>QEMU 2.4.0 has just been released, and among many other things it comes with some of the stuff I have been working on lately. In this blog post I am going to talk about disk I/O limits and the new feature to group several disks together.

Disk I/O limits

Disk I/O limits allow us to control the amount of I/O that a guest can perform. This is useful for example if we have several VMs in the same host and we want to reduce the impact they have on each other if the disk usage is very high.

The I/O limits can be set using the QMP command block_set_io_throttle, or with the command line using the throttling.* options for the -drive parameter (in brackets in the examples below). Both the throughput and the number of I/O operations can be limited. For a more fine-grained control, the limits of each one of them can be set on read operations, write operations, or the combination of both:

One additional parameter named iops_size allows us to deal with the case where big I/O operations can be used to bypass the limits we have set. In this case, if a particular I/O operation is bigger than iops_size then it is counted several times when it comes to calculating the I/O limits. So a 128KB request will be counted as 4 requests if iops_size is 32KB.

iops_size (throttling.iops-size): Size of an I/O request (in bytes).

Group throttling

All of these parameters I’ve just described operate on individual disk drives and have been available for a while. Since QEMU 2.4 however, it is also possible to have several drives share the same limits. This is configured using the new group parameter.

The way it works is that each disk with I/O limits is member of a throttle group, and the limits apply to the combined I/O of all group members using a round-robin algorithm. The way to put several disks together is just to use the group parameter with all of them using the same group name. Once the group is set, there’s no need to pass the parameter to block_set_io_throttle anymore unless we want to move the drive to a different group. Since the I/O limits apply to all group members, it is enough to use block_set_io_throttle in just one of them.

In this example, hd1, hd2 and hd4 are all members of a group named foo with a combined IOPS limit of 6000, and hd3 and hd5 are members of bar. hd6 is left alone (technically it is part of a 1-member group).

Next steps

I am currently working on providing more I/O statistics for disk drives, including latencies and average queue depth on a user-defined interval. The code is almost ready. Next week I will be in Seattle for the KVM Forum where I will hopefully be able to finish the remaining bits.

]]>http://blogs.igalia.com/berto/2015/08/14/io-limits-for-disk-groups-in-qemu-2-4/feed/12QEMU and open hardware: SPEC and FMC TDChttp://blogs.igalia.com/berto/2012/11/28/qemu-and-open-hardware-spec-and-fmc-tdc/
http://blogs.igalia.com/berto/2012/11/28/qemu-and-open-hardware-spec-and-fmc-tdc/#commentsWed, 28 Nov 2012 08:31:45 +0000http://blogs.igalia.com/berto/?p=526Continue reading QEMU and open hardware: SPEC and FMC TDC→]]>Working with open hardware

Some weeks ago at LinuxCon EU in Barcelona I talked about how to use QEMU to improve the reliability of device drivers.

At Igalia we have been using this for some projects. One of them is the Linux IndustryPack driver. For this project I virtualized two boards: the TEWS TPCI200 PCI carrier and the GE IP-Octal 232 module. This work helped us find some bugs in the device driver and improve its quality.

Now, those two boards are examples of products available in the market. But fortunately we can use the same approach to develop for hardware that doesn’t exist yet, or is still in a prototype phase.

Why we use QEMU

So we are developing the device driver for this hardware, as my colleague Samuel explains in his blog. I’m the responsible of virtualizing it using QEMU. There are two main reasons why we want to do this:

Limited availability of the hardware: although the specification is pretty much ready, this is still a prototype. The board is not (yet) commercially available. With virtual hardware, the whole development team can have as many “boards” as it needs.

Testing: we can test the software against the virtual driver, force all kinds of conditions and scenarios, including the ones that would probably require us to physically damage the board.

While the first point might be the most obvious one, testing the software is actually the one we’re more interested in.

Writing the virtual hardware

Writing a virtual version of a particular piece of hardware for this purpose is not as hard as it might look.

First, the point is not to reproduce accurately how the hardware works, but rather how it behaves from the operating system point of view: the hardware is a black box that the OS talks to.

Second, it’s not necessary to have a complete emulation of the hardware, there’s no need to support every single feature, particularly if your software is not going to use it. The emulation can start with the basic functionality and then grow as needed.

The FMC TDC, for example, is an FMC card which is in our case connected to a PCIe bridge called SPEC (also available in the Open Hardware repository).

We need to emulate both cards in order to have a working system, but the emulation is, at the moment, treating both as if they were just one, which makes it a bit easier to have a prototype and from the device driver point of view doesn’t really make a difference. Later the emulation can be split in two as I did with with TPCI200 and IP-Octal 232. This would allow us to support more FMC hardware without having to rewrite the bridging code.

There’s also code in the emulation to force different kind of scenarios that we are using to test if the driver behaves as expected and handles errors correctly. Those tests include the simulation of input in the any of the lines, simulation of noise, DMA errors, etc.

And we have written a set of test cases and a continuous integration system, so the driver is automatically tested every time the code is updated. If you want details on this I recommend you again to read Miguel’s post.

]]>http://blogs.igalia.com/berto/2012/11/28/qemu-and-open-hardware-spec-and-fmc-tdc/feed/2Igalia at LinuxCon Europehttp://blogs.igalia.com/berto/2012/11/05/igalia-at-linuxcon-europe/
http://blogs.igalia.com/berto/2012/11/05/igalia-at-linuxcon-europe/#respondMon, 05 Nov 2012 22:08:05 +0000http://blogs.igalia.com/berto/?p=517Continue reading Igalia at LinuxCon Europe→]]>I came to Barcelona with a few other Igalians this week for LinuxCon, the Embedded
Linux Conference and the KVM Forum.

We’ll be around the whole week so you can come and talk to us anytime. You can find us at our booth on the ground floor, where you’ll also be able to see a few demos of our latest work and get some merchandising.

In the past months we have been working at Igalia to give Linux support to IndustryPack devices.

IndustryPack modules are small boards (“mezzanine”) that are attached to a carrier board, which serves as a bridge between them and the host bus (PCI, VME, …). We wrote the drivers for the TEWS TPCI200 PCI carrier and the GE IP-OCTAL-232 module.

My mate Samuel was the lead developer of the kernel drivers. He published some details about this work in his blog some time ago.

The drivers are available in latest Linux release (3.6 as of this writing) but if you want the bleeding-edge version you can get it from here (make sure to use the staging-next branch).

IndustryPack emulation for QEMU

Along with Samuel’s work on the kernel driver, I have been working to add emulation of the aformentioned IndustryPack devices to QEMU.

The work consists on three parts:

TPCI200, the bridge between PCI and IndustryPack.

The IndustryPack bus.

IPOCTAL-232, an IndustryPack module with eight RS-232 serial ports.

I decided to split the emulation like this to be as close as possible to how the hardware works and to make it easier to reuse the code to implement other IndustryPack devices.

The emulation is functional and can be used with the existing Linux driver. Just make sure to enable CONFIG_IPACK_BUS, CONFIG_BOARD_TPCI200 and CONFIG_SERIAL_IPOCTAL in the kernel configuration.

I submitted the code to QEMU, but it hasn’t been integrated yet, so if you want to test it you’ll need to patch it yourself: get the QEMU source code and apply the TPCI200 patch and the IP-Octal 232 patch. Those patches have been tested with QEMU 1.2.0.

And here’s how you run QEMU with support for these devices:

$ qemu -device tpci200 -device ipoctal

The IP-Octal board implements eight RS-232 serial ports. Each one of those can be redirected to a character device in the host using the functionality provided by QEMU. The ‘serial0‘ to ‘serial7‘ parameters can be used to specify each one of the redirections.

Example:

$ qemu -device tpci200 -device ipoctal,serial0=pty

With this, the first serial port of the IP-Octal board (‘/dev/ipoctal.0.0.0‘ on the guest) will be redirected to a newly-allocated pty on the host.

LinuxCon Europe

Having virtual hardware allows us to test and debug the Linux driver more easily.

I came here in 1996 to study Computer Science. Here I discovered UNIX for the first time, and spent hours learning how to use it. It’s funny to see now those old UNIX servers being displayed in a small museum in the auditorium where the main track takes place.

It was also here where I learnt about free software, installed my first Debian system, helped creating the local LUG and met the awesome people that founded Igalia with me. Then we went international, but our headquarters and many of our people are still here so I guess we can still call this home.

So, needless to say, we are very happy to have GUADEC here this time.

I hope you all are enjoying the conference as much as we are. I’m quite satisfied with how it’s been going so far, the local team has done a good job organising everything and taking care of lots of details to make the life of all attendees easier. I especially want to stress all the effort put into the networkinfrastructure, one of the best that I remember in a GUADEC conference.

At Igalia we’ve been very busy lately. We’re putting lots of effort in making WebKit better, but our work is not limited to that. Our talks this year show some of the things we’ve been doing:

My name is Alberto Garcia and this is my first (well, second) post in Planet Debian. So I’ll introduce myself.

I’m a free software developer and enthusiast from Galicia, Spain. I studied computer science at the Corunha University, where I first heard about GNU/Linux and Debian. After leaving university I co-founded Igalia with a group of friends. We’ve been working on lots of different things during all these years, but some projects we’ve been particularly involved in include GNOME, Maemo/MeeGo and WebKit.

Although I’ve been using Debian for more than a decade now (my first -and still running!- installation was in 1997) and I’m quite familiar with the distribution, it wasn’t until a few years ago that I started maintaining packages officially.

Apart from software, my other main hobby is music. When I’m not using my computer you’ll find me at some concert (preferably small bands: I hate long queues, crowded places and being far from the stage). It’s no coincidence that my first Debian package was a Last.fm player

I don’t have much more to add. I’d like to thank Ana for adding me to the planet, and I’m proud to be part of the Debian community!

]]>http://blogs.igalia.com/berto/2011/11/13/hello-planet-debian/feed/3FileTea now available in Debianhttp://blogs.igalia.com/berto/2011/11/10/filetea-now-available-in-debian/
http://blogs.igalia.com/berto/2011/11/10/filetea-now-available-in-debian/#commentsThu, 10 Nov 2011 10:09:16 +0000http://blogs.igalia.com/berto/?p=414Continue reading FileTea now available in Debian→]]>In the past few weeks I’ve been preparing the Debian packages of FileTea and its companion EventDance. They’re finally available.

FileTea is a free, web-based file sharing system that just works. It only requires a browser, and no user registration is needed. If you want to know more about it, you can read my previous blog post. For a more detailed description, read Nathan Willis’s excellent article on LWN.net. There have been a few changes since that article (HTTPS support in particular) but it’s still the best one you can find on the net.

Igalia still provides a FileTea server at http://filetea.me/, that you can use to share your files and see how it works. We plan to keep offering this service, but you don’t need to trust it/depend on it anymore: now you can apt-get install filetea and have your own.