A few weeks back, a strange bird call started waking me up. Though red-whiskered bulbuls are supposed to be pretty common, I’d not heard them or seen one up close.

There were two of them making rounds throughout the day, and they used to visit one plant in my terrace-garden frequently. I took that as a sign that they’re building a nest, and ensured they don’t get disturbed when the visited. I tried taking a few pictures, but couldn’t manage good ones: these birds are very shy, and any they’re very alert to any movement or human presence. There’s a better photo at wikipedia.

Exactly one week after they started arriving, they had built a nest and they didn’t make as much noise as earlier. That made me curious. I checked out their nest, and was quite delighted to see an egg:

Red whiskered bulbul’s nest with egg

Red whiskered bulbul’s nest with egg — camera flash on.

The next day I woke up to a lot of mayhem, lots of bird noises near the plant. I didn’t feel like disturbing anything there. When the commotion died down, I went to check, and saw the egg was missing. That was a sad end to the week-long activity around the nest. I initially suspected pigeons, permanent residents in the terrace, to have caused the damage. However, later in the day, a big crow came by near the same plant (crows have never come inside the terrace earlier). Wonder what this all means, and where the egg vanished.

The Linux Plumbers Conf wiki seems to have made the discussion notes for the 2012 conf read-only as well as visible only to people who have logged in. I suspect this is due to the spam problem, but I’ll put those notes here so that they’re available without needing a login. The source is here.

paper on ‘net showed perf improved from 5-6Gig/s to wire speed, using emulation of this tech.

intel have numbers on their slides.

they used sr-iov 10gbe; measured vmexit

interrupt window: when hypervisor wants to inject interrupt, guest may not be running. hyp. has to enter vm. when guest is ready to receive interrupt, it comes back with vmexit. problem: as you need to inject interrupt, more vmexits, guest becomes busier. so: they wanted to eliminate them.

read case: if you have something in advance (apic page), hyp can just point to that instead of this exit dance

more than 50% exits are interrupt-related or apic related.

new features for interrupt/apic virt

reads are redirected to apic page

writes: vmexit after write; not intercepted. no need for emluation.

virt-interrupt delivery

extend tpr virt to other apic registers

eoi – no need for vm exits (using new bitmap)

this looks different from amd

but for eoi behaviour, intel/amd can have common interface.

intel/amd comparing their approaches / features / etc.

most notably, intel have support for x2apic, not for iommu. amd have support for iommu, not for x2apic.

for apic page, approaches mostly similar.

virt api can have common infra, but data structures are totally different. intel spec will be avl. in a month or so (update: already available now). amd spec shd be avl in a month too.

how do you handle disk access? network is easier – n/w stack resumes on failover. if you don’t do failover in a state where you know disk is in a consistent state, you can get corruption.

Two solutions

For NAS, do same compares as with responses (this can also trigger checkpoints).

On local disks, buffer original state of changed pages, revert to original and them checkpoint with primary nodes disk writes included. This is equivalent to how the memory image is updated. (This was not described complete enough during the session).

that sounds dangerous. client may have acked data, etc.

will have to look closer at this. (More complete explanation above counters this)

how often do you get mismatch?

depends on workload. some were like 300-400 packets of good packets, then a mismatch.

during that, are you vulnerable to failure?

no, can failover at any point. internal state doesn’t matter. Both VMs, provide consistent request streams from their initial state and match responses up to the moment of failover.

allow IOMMU driver to define device visibility – not per-device, but the whole group exposed

more modular

what’s different from pci device assignment

x86 only

kvm only

no iommu grouping

relies on pci-sysfs

turns kvm into a device driver

current staus

core pci and iommu drivers in 3.6

qemu will be pushed for 1.3

what’s next?

qemu integration

legacy pci interrupts

more of a qemu-kvm problem, since vfio already supports this, but these are unique since they’re level-triggered; host has to mask interrupt so it doesn’t cause a DoS till guest acks interrupt

like to bypass qemu directly – irqfd for edge-triggered. now exposing irqfd for level

(lots of discussion here)

libvirt support

iommu grps changed the way we do device assignment

sysfs entry point; move device to vfio driver

do you pass group by file descriptor?

lots of discussion on how to do this

existing method needs name for access to /sys

how can we pass file descriptors from libvirt for groups and containers to work in different security models?

The difficulty is in how qemu assembles the groups and containers. On the qemu command line, we specify an individual device, but that device lives in a group, which is the unit of ownership in vfio and may or may not be connectable to other containers. We need to figure out the details here,

POWER support

already adding

PowerPC

freescale looking at it

one api for x86, ppc was strange

error reporting

better ability to inject AER etc to guest

maybe another ioctl interrupt

What are we going to be able to do if we do get PCIe AER errors to show up at a device, what is the guest going to be able to do (for instance can it reset links).

We’re going to have to figure this out and it will factor into how much of the AER registers on the device do we expose and allow the guest to control. Perhaps not all errors are guest serviceable and we’ll need to figure out how to manage those.

Scaling at layer 2 is limited by the need to support broadcsat/multicast over the network

overlay networks

when migrating across domains (subnets), have to re-number IP addresses

when migrating need to migrate IP and MAC addresses

When migrating across subnets might need to re-number or find another mechanism

solution is to have a set of tunnels

every end-user can view their domain/tunnel as a single virtual network

they only see their own traffic, no one else can see their traffic.

standardization is required

being worked on at IETF

MTU seen as VM is not same as what is on the physical network (because headers added by extra layers)

vxlan adds udp headers

one option is to have large(er) physical MTU so it takes care of this otherwise there will be fragmentation

Proposal

If guest does pathMTU discovery let tunnel end point return the ICMP error to reduce the guest’s view of the MTU.

Even if the guest has not set the DF (dont fragment) bit return an ICMP error. The guest will handle the ICMP error and update its view of the MTU on the route.

having the hypervisor to co-operate so guests do a path MTU discovery and things work fine

no guest changes needed, only hypervisor needs small change

(discussion) Cannot assume much about guests; guests may not handle ICMP.

Some way to avoid flooding

extend to support an ‘address resolution module’

Stephen Hemminger supported the proposal

Fragmentation

can’t assume much about guests; they may not like packets getting fragmented if they set DF

fragmentation highly likely since new headers are added

The above is wrong comment since if DF is set we do pathMTU and the packet wont be fragmented. Also, the fragmentation if done is on the tunnel. The VM’s dont see fragmentation but it is not performant to fragment and reassemble at end points.

Instead the proposal is to use PathMTU discovery to make the VM’s send packets that wont need to be fragmented.

PXE, etc., can be broken

Distributed Overlay Ethernet Network

DOVE module for tunneling support

use 24-bit VNI

patches should be coming to netdev soon enough.

possibly using checksum offload infrastructure for tunneling

question: layer 2 vs layer 3

There is interest in the industry to support overlay solutions for layer 2 and layer 3.

Several applications need random numbers for correct and secure operation. When ssh-server gets installed on a system, public and private key paris are generated. Random numbers are needed for this operation. Same with creating a GPG key pair. Initial TCP sequence numbers are randomized. Process PIDs are randomized. Without such randomization, we’d get a predictable set of TCP sequence numbers or PIDs, making it easy for attackers to break into servers or desktops.

On a system without any special hardware, Linux seeds its entropy pool from sources like keyboard and mouse input, disk IO, network IO, and any other sources whose kernel modules indicate they are capable of adding to the kernel’s entropy pool (i.e .the interrupts they receive are from sufficiently non-deterministic sources). For servers, keyboard and mouse inputs are rare (most don’t even have a keyboard / mouse connected). This makes getting true random numbers difficult: applications requesting random numbers from /dev/random have to wait for indefinite periods to get the randomness they desire (like creating ssh keys, typically during firstboot.).

For applications that need random numbers instantaneously, but can make do with slightly low-quality random numbers, they have the option of getting their randomness from /dev/urandom, which doesn’t block to serve random numbers — it’s just not guaranteed that the numbers one receives from /dev/urandom truly reflect pure randomness. Indiscriminate reading of /dev/urandom will reduce the system’s entropy levels, and will starve applications that need true random numbers. Random numbers in a system are a rare resource, so applications should only fetch them when they are needed, and only read as many bytes as needed.

There are a few random number generator devices that can be plugged into computers. These can be PCI or USB devices, and are fairly popular add-ons on servers. The Linux kernel has a hwrng (hardware random number generator) abstraction layer to select an active hwrng device among several that might be present, and ask the device to give random data when the kernel’s entropy pool falls below the low watermark. The rng-tools package comes with rngd, a daemon, that reads input from hwrngs and feeds them into the kernel’s entropy pool.

Virtual machines are similar to server setups: there is very little going on in a VM’s environment for the guest kernel to source random data. A server that hosts several VMs may still have a lot of disk and network IO happening as a result of all the VMs it hosts, but a single VM may not be doing much to itself generate enough entropy for its applications. One solution, therefore, to sourcing random numbers in VMs is to ask the host for a portion of the randomness it has collected, and feed them into the guest’s entropy pool. A paravirtualized hardware random number generator exists for KVM VMs. The device is called virtio-rng, and as the name suggests, the device sits on top of the virtio PV framework. The Linux kernel gained support for virtio-rng devices in kernel 2.6.26 (released in 2008). The QEMU-side device was added in the recent 1.3 release.

On the host side, the virtio-rng device (by default) reads from the host’s /dev/random and feeds that into the guest. The source of this data can be modified, of course. If the host lacks any hwrng, /dev/random is the best source to use. If the host itself has a hwrng, using input from that device is recommended.

Newer Intel architectures (IvyBridge onwards) have an instruction, RDRAND, that provides random numbers. This instruction can be directly exposed to guests. Guests probe for the presence of this instruction (using CPUID) and use it if available. This doesn’t need any modification to the guest. However, there’s one drawback to exposing this instruction to guests: live migration. If not all hosts in a server farm have the same CPU, live-migrating a guest from one host that exposes this instruction to another that doesn’t, will not work. In this case, virtio-rng in the host can be configured to use RDRAND as its source, and the guest can continue to work as in the previous example. This is still sub-optimal, as we’ll be passing random numbers to the guest (as in the case of /dev/random), instead of real entropy. The RDSEED instruction, to be introduced later (Broadwell onwards) will provide entropy that can be safely passed on to a guest via virtio-rng as a source of true random entropy, eliminating the need to have a physical hardware random number generator device.

It looks like QEMU/KVM is the only hypervisor that has the support for exposing a hardware random number generator to guests. (One could pass through a real hwrng to a guest, but that doesn’t scale and isn’t practical for all situations — e.g. live migration.) Fedora 19 will have QEMU 1.4, which has the virtio-rng device, and even older guests running on top of F19 will be able to use the device.

For more information on virtio-rng, see the QEMU feature page, and the Fedora feature page. LWN.net has an excellent article on random numbers, based on H. Peter Anvin’s talk at LinuxCon EU 2012.

Updated 2013 May 22: Added info about RDSEED and the Fedora feature page, corrected few typos.

I’ve been using the Fedora 18 pre-release for a couple of months now, and am generally happy with how it works. I filed quite a few bugs, some got resolved, some not. Here’s a list of things that don’t work as they used to in the past, with workarounds so they may help others:

Bug 878619 – Laptop always suspends on lid close, regardless of g-s-t policy: I used to set the action on laptop lid close to lock the screen by default, instead of putting it in the suspend state. I used to use the function keys or menu item to suspend earlier. However, with GNOME 3.6 in F18, the ‘suspend’ menu item has gone away, replaced by ‘Power Off’. The developers have now removed the dconf settings to tweak the action of lid close (via gnome-tweak-tool or dconf-editor). As described in GNOME Bug 687277, this setting can be tweaked by adding a systemd inhibitor:

Bug 878412 – Cannot assign shortcuts to switch to workspaces 5+: I use keyboard shortcuts (Ctrl+F<n>) to switch workspaces. Till F16, I could assign shortcuts to as many workspaces as are currently in use. Curiously, with F18, shortcuts can only be assigned to workspaces 1 through 4. This was a major productivity blocker for me, and an ugly workaround is to create a shell script that switches workspaces via window manager commands: install ‘wmctrl’, and create custom shortcuts to switch workspaces by invoking ‘wmctrl -s <workspace-1>’. wmctrl counts workspaces from 0, so to switch to workspace 5, invoke ‘wmctrl -s 4′.

Bug 878736 – Desktop not shown after unlocking screensaver: This one is due to some focus-stealing apps and gnome-shell’s new screensaver not working together. I use workrave, an app that helps me keep my eyesight and wrists in relatively good shape. Other people have complained even SDL windows (games, qemu VMs, etc.) interact badly with the new screensaver. For my workaround, I’ve set workrave to not capture focus for now.

Bug 878981 – “Alt + Mouse click in a window + mouse move” doesn’t move windows anymore: The modifier key is now changed to the ‘Super’ key, so Super + mouse click + mouse move works in a similar way to how using the Alt key worked earlier. I’m still lacking the window resize modifier that KDE offers (modifier key + right-click+mouse move)

Other than these, a couple of bugs that affect running F18 in virtual machines:

Bug 864567 – display garbled in KVM VMs on opening windows: Using any other display driver for the guest other than cirrus works fine.

Bug 810040 – F17/F18 xen/kvm/vmware/hyperv guest with no USB: gnome-shell fails to start if fprintd is present: I mentioned this earlier as well: remove fprintd in the VM, or add ‘-usb’ to the qemu command line.

We have post of Mystery Shopper in your area. All you need is to act like a customer, you be will surveying different outlets like Walmart, Western Union, etc and provide us with detailed information about their service.

You will get $200.00 per one task and you can handle as many tasks as you want. Each assignment will take one hour and it wont affect your present occupation because it is flexible.

Before any task we will give you with the resources needed. You will be sent a check or money order, which you will cash and use for the task. Included to the check would be your assignment payment, then we will provide you details through email. You just need to follow instruction given to you as a Secret Shopper.

If you are interested, please fill in the details below and send it back to us to john_paul2_john@aol.com for approval.

First Name:
Last Name:
Full Address:
City, State and Zip code:
Cell and Home Phone Numbers:
Email:

Hope to hear from you soon.

Head of Operations,
John Paul.

I can’t resist going shopping — and being paid for it! Posted this here in case anyone else missed this email due to “bad” spam filters. We don’t have Walmart here yet, but we certainly do have Western Union.

PS: If you’re interested in treasure hunts: can you spot who’s actually sending these messages?

If you have enabled git information in the shell prompt (like branch name, working tree status, etc.) [1], an upgrade to F18 breaks this functionality. What’s worse, __git_ps1 (a shell function) isn’t found, and a yum plugin goes looking for a matching package name to install, making running any command on the shell *very* slow.

Avi Kivity announced he is stepping down as (co-)maintainer of the KVM Project at the recently-concluded KVM Forum 2012 in Barcelona, Spain. Avi wrote the initial implementation of the KVM code back at Qumranet, and has been maintaining the KVM-related kernel and qemu code for about 7 years now.

In his keynote speech, he mentioned he’s founding a startup with a friend, and hopes to create new technology as exciting as KVM. He also mentioned they’re in stealth mode right now, so questions about the new venture didn’t get any answers.

He returned to the stage on the second day of the Forum to talk about the new memory API work he’s been doing in qemu, and in his typical dry humour, he mentioned he was supposed to vanish in a puff of smoke after his keynote, but the special effects machinery didn’t work, so he was back on stage. Avi later rued the lack of laughter at this joke, and that made him very sad. To offer him some consolation, it was pointed out that not everyone knew of his departure, as many had missed his keynote. He quipped “that’s even worse than not getting laughs”.

His leadership, as well as his humour, will be missed. Personally, he’s helped me grow during the last few years we’ve worked together. But I’m sure whatever he’s working on will be something to look forward to, and we’re not really bidding him adieu from the tech world.

I’ve tried several RSS feed readers, offline as well as online: aKregator, Liferea, rss2email being the ones tried for a long time. One drawback with these offline tools is they may miss feeds when I’m offline for prolonged periods (travel, vacations, etc.). Also, they’re tied to one device; can’t switch laptops and have the feeds be in sync. I tried Google Reader for a while as well, for a solution in the “cloud”, which worked for a while, but not anymore.

So I started to search for an online feed reader, preferably with hosting services, since I didn’t want to keep up with updates to the software. I found several free readers, and Tiny Tiny RSS seemed like a really good option. The developer hosts an online version of the reader, which I used for quite a while. (The online service is soon going to be discontinued.) I was quite content with that option, but when OpenShift was launched, I thought I’d try hosting tt-rss myself: it initially began as an experiment to using OpenShift. Then, when I moved this blog to OpenShift, I realised it didn’t really take much effort to host the blog, and that I could switch my primary instance of tt-rss from the developer-hosted instance to my own. It turned out to be really easy, and here I’ll share my recipe.

After this initial setup, I copied all the files from the ttrss src dir to the php/ directory of the OpenShift repo:

cp -r ~/src/Tiny-Tiny-RSS/* ~/openshift/ttr/php/

Next is to add all the files to the git repo:

cd ~/openshift/ttr/
git add php
git commit -m 'Add tt-rss sources'

Now to set up the environment on the server for tt-rss to work in. E.g. creating directories where tt-rss will store its feed icons, temporary files, etc. This is needed, as the OpenShift git directory is transient: it’s deleted and re-created whenever ‘git push’ is done. So to store persistent data between git pushes, we need to use the OpenShift data directory. Create an app build-time action hook to setup the proper directory structure each time the app is built (i.e. after a git push). Learn more about the different build hooks here.

The last icons bit is a modification from the default of ‘feed-icons’. If you’re setting up a new repo, there’s no need to deviate from the default, but when I had deployed the tt-rss instance, the default icons directory was ‘icons’, which unfortunatley clashes with Apache’s idea of what $URL/icons is. So I used ‘ico’. Remember to modify the bit in the build hook above to create the appropriate symlink if this ICONS_URL is changed.

These config settings are the ones specific to OpenShift. Modify the others to suit your needs.

Lastly, add a cron job to update the feeds at an hourly interval:

cd ~/openshift/ttr
mkdir .openshift/cron/hourly

I created a new file, called update-feeds.sh, in the new .openshift/cron/hourly directory, and added the following to it:

The 2012 edition of the Linux Plumbers Conference concluded recently. I was there, running the virtualization microconference. The format of LPC sessions is to have discussions around current as well as future projects. The key words are ‘discussion’ (not talks — slides are optional!) and ‘current’ and ‘future’ projects — not discussing work that’s already done; rather discussing unsolved problems or new ideas. LPC is a great platform for getting people involved in various subsystems across the entire OS stack in one place, so any sticky problems tend to get resolved by discussing issues face-to-face.

The virt microconf had A LOT of submissions: 17 topics to be discussed in a standard time slot of 2.5 hours for one microconf track. I asked for a ‘double track’, making it 5 hours of time for 17 topics. Still difficult, but reducing a few topics to ‘lightning talks’, we could get a somewhat decent 20 minutes per topic. I contemplated between rejecting topics and thus increasing the time each discusison would get, or keeping all the topics, and asking the people to wrap up in 20 minutes. I went for the latter — getting more stuff discussed (and hence, more problems / issues ‘out there’) is a better use of time, IMO. That would also ensure that people stay on-topic and focussed.

There was also a general change in the way microconfs were scheduled this time: the microconfs were not given a complete 2.5-hour slot. Rather, they were given 3 slots of 45 minutes each. This helped the schedule pages to show the topics of the microconfs being discussed at that time, so the attendees could pick and choose the discussion they wanted to attend, rather than seeing a generic ‘Virtualization Micrconf’ slot. I think this was a good idea. Individual microconf owners could request for modifications to this scheme, of course, and some microconfs just chose to run the entire session in one slot, or reserved one whole day in a room, etc. For the virt microconf, I went with six separate slots, scheduled in a way to avoid conflicts with other virt-related topics in other sessions, giving a total of 4.5 hours for 17 topics.

I segregated the CFP submissions so I could schedule related discussions in one slot, to avoid jumping between subjects and to also help concentrate on specifics in an area. Two submissions, one on security and one on storage, were by themselves, so I clubbed them into one ‘security and storage‘ session. The others were nicely aligned, so we could have ‘x86‘, ‘MM‘, ‘ARM‘, ‘Networking‘ and ‘lightning talks’ topics in separate slots. Since there were 4 network-related talks, I asked for a double slot (two 45-min slots back-to-back), and clubbed the lightning talks in the same session, which was scheduled to be the last session for the virt microconf.

Given this, I would say the microconf went quite well — the notes and slides are up at the LPC 2012 virt microconf wiki, and we could get good discussions going for most of the topics, given the time constraints. Of course, a major benefit of going to conferences is to meet people outside of the sessions, in the hallways and at social events, and the discussions continued there as well. I did bank on this extra time we would have into the ‘reject vs take all of them’ problem mentioned earlier. From what I heard, the beer at the social events failed to stop technical discussions, so it all worked out for the best.

Each microconf owner (or a representative) had to do a short summary at the end of the LPC, for the benefit of the people not present for some sessions. I did the virt summary in roughly these words:

We had a quite productive virtualization microconfierence. We received a lot of submissions, and accepted them all, which meant we had to limit the time for each discussion in the slots, but we could divide the slots by a general topic, effectively increasing the discussion time for the larger topic.

We had a healthy representation from the KVM as well as Xen sides. For example, in the MM topic, we discussed NUMA awareness for KVM as well as Xen. Dario Faggioli presented the Xen side, and Andrea Arcangeli spoke on the Linux/KVM side. Andrea spoke about AutoNUMA. It has been contentious on the mailing lists, and from the Kernel Summit discussions, it looked like some agreement will be reached soon. Xen uses a similar approach to AutoNUMA, and they would end up pushing the patches soon as well. Daniel Kiper spoke about integrating the various balloon drivers in the kernel to remove code duplication.

Both AMD and Intel publically announced new hardware features for interrupt virtualization for the first time here, and it was interesting to see them compare notes and find out what the other is doing and how, for example do they support IOMMU? x2apic? Etc.

New ARM architecture support work was presented by Marc Zyngier for the KVM effort, and Stefano Stabellini for the Xen effort. Much of the work seems to be done, and patches are in a shape to be applied for the next merge window. There are a few open issues, and they were discussed as well.

We had quite a few talks for the networking session. Alex Williamson spoke about VFIO, which seemed to get mentioned a lot throughout the conference in multiple sessions. This is a new way of doing device assignment, and progress looks positive, with the kernel side already merged in 3.6, and qemu patches queued up for 1.3. Alex Graf then talked about ‘semi-assignment’, a way to do device assignment (or pci passthrough) while also getting proper migration support. The effort involved writing device emulation for each device supported, and the approach wasn’t too popular. IBM and Intel guys have been doing virtio net scalability testing, and John Fastabend spoke about some optimisations, which were generally well-received. We should expect patches and more benchmarks soon. Vivek Kashyap spoke about network overlays, and how creating a tunnel for networks for VMs can help with VM migration across networks.

We also had a session on security, by Paul Moore, who gave an overview of the various methods to secure VMs, specifically the new seccomp work.

Lastly, we had Bharata Rao talk about introducing a glusterfs backend for qemu, replacing qemu’s block drivers, which gives more flexibility in handling disk storage for VMs.

The organisers are collecting feedback, so if you were there, be sure to let them know of your experience, and what we could do better in the coming years.

The GNOME default of ‘hibernate’ or suspend-to-disk on very low battery power isn’t optimal for many laptops — hibernate is known to be broken on several hardware setups, it frequently results in file system corruption, and just causes pain. That, combined with the weird behaviour of the GNOME power manager to put the system in hibernate, even when the battery isn’t low, annoyed me enough to go hunting for a way to change the default.

The GUI doesn’t expose a ‘sleep’ setting; it just offers hibernate and shutdown, so here’s a tip to just put the system to sleep state (suspend to RAM), which is a much well-behaved default for me.

Install dconf-editor, and go to

org.gnome.settings-daemon.plugins.power

and modify the

critical-battery-action

to suspend.

For the curious, the weird behaviour of the GNOME power manager I mentioned above is noted in these bug reports: