February 4, 2016

So recently I got pointed to an aging blocker bug that needed attention, since it negatively affected some rawhide users: they weren’t able to launch certain applications. Three known broken applications were gnome-terminal, nautilus, and gedit. Other applications worked, and even these 3 applications worked in wayland, but not Xorg. The applications failed with messages like:

left in the log. These messages means that the programs are unable to create a connection to the X server. There are only a few reasons this error message could get displayed:

— The socket associated with the X server has become unavailable. In the old days this could happen if, for instance, the socket file got deleted from /tmp. Adam Jackson fixed the X server a number of years ago, to also listen on abstract sockets to avoid that problem. This could also happen if SELinux was blocking access to the socket, but users reported seeing the problem even with SELinux put in permissive mode.

— The X server isn’t running. In our case, clearly the X server is running since the user can see their desktop and launch other programs

— The X server doesn’t allow the user to connect because that user wasn’t given access, or that user isn’t providing credentials. These programs are getting run as the same user who started the session, so that user definitely has access.

— GDM doesn’t require users to provide separate credentials to use the X server, so that’s not it either.

— $DISPLAY isn’t set, so the client doesn’t know which X server to connect to. This is the only likely cause of the problem. Somehow $DISPLAY isn’t getting put in the environment of these programs.

So the next question is, what makes these applications “special”? Why isn’t $DISPLAY set for them, but other applications work fine? Every application has a .desktop file associated with it, which is a small config file giving information about the application (name, icon, how to run it, etc). When a program is run by gnome-shell, gnome-shell uses the desktop file of that program to figure out how to run it. Most of the malfunctioning programs have this in their desktop files:

DBusActivatable=true

That means that the shell shouldn’t try to run the program directly, instead it should ask the dbus-daemon to run the program on the shell’s behalf. Incidentally, the dbus-daemon then asks systemd to run the program on the dbus-daemon’s behalf. That has lots of nice advantages, like automatically integrating program output to the journal, and putting each service in its own cgroup for resource management. More and more programs are becoming dbus activatable because it’s an important step toward integrating systemd’s session management features into the desktop (though we’re not fully there yet, that initiative should become a priority at some point in the near-to-mid future). So clearly the issue is that the dbus-daemon doesn’t have $DISPLAY in its activation environment, and so programs that rely on D-Bus activation aren’t able to open a display connection to the X server. But why?
When a user logs in, GDM will start a dbus-daemon for that user before it starts the user session. It explicitly makes sure that DISPLAY is in the environment when it starts the dbus-daemon so things should be square. They’re obviously not, though, so I decided to try to reproduce the problem. I turned off my wayland session and instead started up an Xorg (actually I used a livecd since I knew for sure the livecd could reproduce the problem) and then looked at a process listing for the dbus-daemon:

This wasn’t run by GDM ! GDM uses different command line arguments that these when it starts the dbus-daemon. Okay, so if it wasn’t getting started by GDM it had to be getting started by the systemd during the PAM conversation right before GDM starts the session. I knew this, because there isn’t really thing other than systemd that runs after the user hits enter at the login screen before gdm starts the user’s session. Also, the command line arguments above in the dbus-daemon instance say ‘–systemd-activation’ which is pretty telling. Furthermore, if a dbus-daemon is already running GDM will avoid starting a second one, so this all adds up. I was surprised that we were using the so called “user bus” instead of session bus already in rawhide. But, indeed, running

show’s we’re clearly starting the dbus-daemon before GDM starts the session. Of course, this poses the problem. The dbus-daemon can’t possibly have DISPLAY set in its environment if it’s started before the X server is started. Even if it “wanted” to set DISPLAY it couldn’t even know what value to use, since there’s no X server running yet to tell us the DISPLAY !

So what’s the solution? Many years ago I added a feature to D-Bus to allow a client to change the environment of future programs started by the dbus-daemon. This D-Bus method call, UpdateActivationEnvironment, takes a list of key-value pairs that are just environment variables which get put in the environment of programs before they’re activated. So, the fix is simple, GDM just needs to update the bus activation environment to include DISPLAY as soon as it has a DISPLAY to include.

February 1, 2016

So in my last blog post I mentioned Matthias was getting SIGBUS when using wayland for a while. You may remember that I guessed the problem was that his /tmp was filling up, and so I produced a patch to stop using /tmp and use memfd_create instead. This resolved the SIGBUS problem for him, but there was something gnawing at me: why was his /tmp filling up? I know gnome-terminal stores its unlimited scrollback buffer in an unlinked file in /tmp so that was one theory. I also have seen, in some cases firefox downloading files to /tmp. Neither explanation sat well with me. scrollback buffers don’t get that large very quickly and Matthias was seeing the problem several times a day. I also doubted he was downloading large files in firefox several times a day. Nonetheless, I shrugged, and moved on to other things…

…until Thursday. Kevin Fenzi mentioned on IRC that he was experiencing a 12GB leak in gnome-shell. That piqued my interest and seemed pretty serious, so I started to troubleshoot with him. My first question was “Are you using the proprietary nvidia driver?”. I asked this because I know the nvidia driver has in the past had issues with leaking memory and gnome-shell. When Kevin responded that he was on intel hardware I then asked him to post the output of /proc/$(pidof gnome-shell)/maps so we could see the make up of the lost memory. Was it the heap? or some other memory mapped regions? To my surprise it was the memfd_create’d shared memory segments from my last post ! So window pixel data was getting leaked. This explains why /tmp was getting filled up for Matthias before, too. Previously, the shared memory segments resided in /tmp after all, so it wouldn’t have taken long for them to use up /tmp.

Of course, the compositor doesn’t create the leaked segments, the clients do, and then those clients share them with the compositor. So we probed a little deeper and found the origin of the leaking segments; they were coming from gnome-terminal. My next thought was to try to reproduce. After a few minutes I found out that typing:

$ while true; do echo; done

into my terminal and then switching focus to and from the terminal window made it leak a segment every time focus changed. So I had a reproducer and just needed to spend some time to debug it. Unfortunately, it was the end of the day and I had to get my daughter from daycare, so I shelved it for the evening. I did notice before I left, though, one oddity in the gtk+ wayland code: it was calling a function named _gdk_wayland_shm_surface_set_busy that contained a call to cairo_surface_reference. You would expect a function called set_something to be idempotent. That is to say, if you call it multiple times it shouldn’t add a new reference to a cairo surface each time. Could it be the surface was getting set “busy” when it was already set busy, causing it to leak a reference to the cairo surface associated with the shared memory, keeping it from getting cleaned up later?

I found out the next day, that indeed, was the case. That’s when I came up with a patch to make sure we never call set_busy when the surface was already busy. Sure enough, it fixed the leak. I wasn’t fully confident in it, though. I didn’t have a full big picture understanding of the whole workflow between compositor and gtk+, and it wasn’t clear to me if set_busy was supposed to ever get called when the surface was busy. I got in contact with the original author of the code, Jasper St. Pierre, to get his take. He thought the patch was okay (modulo some small style changes), but also said that part of the existing code needed to be redone.

The point of the busy flag was to mark a shared memory region as currently being read by the compositor. If the buffer was busy, then the gtk+ couldn’t draw to it without risking stepping on the compositors toes. If gtk+ needed to draw to a busy surface, it instead allocated a temporary buffer to do the drawing and then composited that temporary buffer back to the shared buffer at a later time. The problem was, as written, the “later time” wasn’t necessarily when the shared buffer was available again. The temporary buffer was created right before the toolkit staged some pixel updates, and copied back to the shared buffer after the toolkit was done with that one draw operation. The temporary buffer was scoped to the drawing operation, but the shared buffer wouldn’t be available for new contents until the next frame event some milliseconds later.

So my plan, after conferring with Matthias, was to change the code to not rely on getting the shared buffer back. We’d allocate a “staging” buffer, do all draw operations to it, hand it off to the compositor when we’re done doing updates and forget about it. If we needed to do new drawing we’d allocate a new staging buffer, and so on. One downside of this approach is the new staging buffer has to be initialized with the contents of the previously handed off buffer. This is because, the next drawing operation may only update a small part of the window (say to blink a cursor), and we need the rest of the window to properly get drawn in that. This read back operation isn’t ideal, since it means copying around megabytes of pixel data. Thankfully, the wayland protocol has a mechanism in place to avoid the costly copy in most cases:

→ If a client receives a release event before the frame callback
→ requested in the same wl_surface.commit that attaches this
→ wl_buffer to a surface, then the client is immediately free to
→ re-use the buffer and its backing storage, and does not need a
→ second buffer for the next surface content update.

So that’s our out. If we get a release event on the buffer before the next frame event, the compositor is giving us the buffer back and we can reuse it as the next staging buffer directly. We would only need to allocate a new staging buffer if the compositor was tardy in returning the buffer to us. Alright, I had plan and hammered out a patch on friday. It didn’t leak, and from playing with the machine for while, everything seemed to function, but there was one hiccup: i set a breakpoint in gdb to see if the buffer release event was coming in and it wasn’t. That meant we were always doing the expensive copy operation. Again, I had to go, so I posted the patch to bugzilla and didn’t look at it again until the weekend. That’s when I discovered mutter wasn’t sending the release event for one buffer until it got replaced by another. I fixed mutter to send the release event as soon as it uploaded the pixel data to the gpu and then everything started working great, so I posted the finalized version of the gtk+ patch with a proper commit message, etc.

There’s still some optimization that could be done for compositors that don’t handle early buffer release. Rather than initializing the staging buffer using cairo, we could get away with doing a lone memcpy() call. We know the buffer is linear and each row is right next to the previous in memory, so memcpy might be faster than going through all the cairo/pixman machinery. Alternatively, rather than initializing the staging buffer up front with the contents of the old buffer, we could wait until drawing is complete, and then only draw the parts of the buffer that haven’t been overwritten. Hard to say what the right way to go is without profiling, but both weston on gl and mutter support the early release feature now, so maybe not worth spending too much time on anyway.

January 26, 2016

So every year or two I try to get my blog going again, end up doing a post or two and then drop off the map again. 2016’s new years resolution is to blog more, so here’s one more go… I’ll just give a summary of some of things I did in the last few days (mostly rambling, hope that’s okay).

spontaneous sigbus in applications under wayland — Matthias was getting spontaneous crashes from wayland clients when using his desktop for an extended period of time. Normally when it hit him, several apps would go down at once. Stacktrace showed it dying in pixman in an SSE function. My first thought was some sort of memory alignment problem in the image buffer since I know SSE2’s has alignment restrictions. I read through the code as carefully as I could and did come up with some clean ups but nothing to suggested incorrect alignment. pixman handled 4 byte aligned image data fine, and we were using 4 bytes per pixel. My next guess was that perhaps the filesystem backing the shared memory was getting full. GDK was using /tmp, which gets used for a lot of other random things, so it was at least a plausible theory. Indeed, if I ran /usr/bin/fallocate to allocate a large file and fill up my own /tmp I could reproduce the failure. So my first inclination was to switch to shm_open() instead of open() so that /dev/shm would get used in place of /tmp. The downside to that, is there’s no shm_open() equivalent to mkstemp, so the patch included some hairy code for generating a non-clashing shm segment name. My next idea was to use memfd_create which landed in the kernel a couple of years ago, and still has that new car smell. Turns out there’s no glibc wrapper for it yet, though a patch was posted. I chatted with the libc maintainer and the author of the patch, it just fell through the cracks for non-technical reasons, so will hopefully land in the future. In the mean time, I did the patch using syscall() directly. Normally when resorting to something as ad hoc as a raw syscall() call, I would do configure checks and fallback code, but since this is all wayland anyway, and not relevant on legacy systems, we can get away without that baggage.

another round at a bug nvidia users encounter at the login screen — Last year I got a report of a very strange bug: for users of proprietary nvidia driver, after they type their password and hit enter, the login screen would just freeze up and login wouldn’t proceed until they switched vts or hit escape. This was an extremely strange bug that had multiple facets explained here

. Unfortunately, one of the four fixes ended up causing a mysterious crasher for a user, and it wasn’t strictly needed as long as one or more of the other fixes went in, so I reverted it until I could investigate the crash in more detail. Not having the patch did have a downside, though: It meant an animation stayed running in the background using up CPU. Recently, it’s been release time, and I wanted a fix in, so I threw the patch back to master and chatted with the reporter on IRC to debug with him the crasher. After some back and forth we discovered the problem and fixed a memory leak, too.

A couple of weeks ago someone on IRC asked for my help debugging a really strange mouse input issue. They told me their pointer wasn’t working very well, and I asked them to switch VTs and switch back to see if that helped. My thought was that perhaps, the X server wasn’t getting access to the input device back from logind and that switching VTs and switching back would fix it. He reported happily that it did fix the problem, and then went into greater detail explaining the symptoms. It wasn’t that the mouse didn’t work at all, it was that it only worked for one application at a time (usually firefox or gnome-terminal but also other applications). That application could change tabs and click, but its window couldn’t be dragged around and no other windows could gain focus. keyboard hot keys worked, but he couldn’t open up the activities overview, and no system modal dialogs would display. This sounded a lot like a stuck pointer grab, but it was weird that it was happening in more than one application, since stuck pointer grabs are normally application specific bugs. After switching VTs he couldn’t reproduce the problem anymore, so I told him to come grab me in person (we work in the same building) if it happened again. Earlier today he said it happened again, so I visited his cubicle. After some debugging, I discovered I could unbreak the system temporarily by gdb attaching to the X server and calling a function to clear all grabs. Then as long as I clicked on window title bars, the system stayed functional, but as soon as I clicked inside the application content area, the application clicked would get a stuck grab and the problem would manifest. I was able to reproduce the problem with xterm, so I knew it wasn’t toolkit specific. I was a little stumped, so I did a chvt call and got him going again, then went back to my cube and thought about it for a few minutes. That’s when I remembered two things:

when pressing the mouse button the x server should implicitly give the client a pointer grab until the button release (by design)

when changing vts, the X server sends mouse release events for every button of every input device

So the conclusion I drew was that, perhaps, the X server wasn’t releasing the implicit pointer grab when the button was released. To test that theory, I needed to see what the behavior of the system would be like if the mouse button was held down while he was using it. I asked him (with his now functional system) to plug a second mouse in, hold down the button from it and then interact with the system using his first mouse. Sure enough, the system behaved in the same broken way as when the problem manifested. So this odd problem really was as if the X server somehow missed a button release, or thought the mouse button was stuck down. It was at this point he realized he had a third, wireless mouse turned on, and smooshed into his backpack. oh. I guess that mystery is solved.

Compositor stack consolidation — one issue we have right now with wayland, is mode setting is split across two parts of the stack. Some of the drm calls happen in cogl and some of the drm calls happen in mutter. we actually get the drm fd from mutter, pass it into cogl, and then pull it back out from cogl at various times. We figure out output setup in mutter using drm apis, then stuff the configuration in cogl abstraction, then unpack it and apply it using drm apis back in cogl. We do cursor handling in mutter directly. It’s pretty haphazard where what happens, and it would be good to consolidate the drm code in one place. Mutter seems like the best place since it offers the dbus api for doing output configuration, and since it’s the more actively maintained component. I think consolidating the code will make it much easier down the line to rework how we do rendering to use EGL extensions instead of GBM for instance. One idea, since cogl isn’t that well maintained, and since it has a lot of code for other platforms that we’ll never need, is to ship a sort of cogl lite in the mutter tree. The problem, though, is clutter is well maintained, and it relies on cogl, too. Well it turns out Jonas was thinking about the same problem, and he came to the realization that there’s a lot of compositor functionality mutter needs that would be hard to add to clutter and not be useful to normal clutter clients. So he proposed Merging clutter and cogl forks into mutter. So I’m working on that, stay tuned.

October 18, 2013

So, a few weeks ago I wrote a post about how I preordered the ATIV Book 9 Plus. After that post, I eagerly waited weeks for the pre-order date which then came and went. I contacted Amazon and they said they didn’t know what was going on, but that it was going to be delayed a full month. They upgraded my shipping to 1-day for free, but that didn’t really satisfy me.

So, I visited the Samsung website and they said they had them in stock and that they “ship in 2-3 days”. The cost was $200 more than amazon, though. Nonetheless, I bit the bullet and put the order through. I kept my amazon order open as a backup, but fully expected to get it from Samsung. Then the next day I get an email from Samsung saying they were back ordered and they didn’t know when they would be shipping my laptop. Almost immediately after getting that email, I got an email from Amazon saying they were wrong, my order wouldn’t be a month delayed, and that it was shipping! I cancelled my order with Samsung and eagerly awaited my laptop. Incidentally, a month later Samsung mailed me the cancelled laptop anyway, and I had to mail it back to them. Chaz, the guy on the phone, said that he had never seen an order that was clearly marked cancelled ship a full month later before in his entire time at Samsung.

Anyway, now a quick review of the laptop:

The build quality is top notch. It’s light but feels dense. It feels very high end

The touchpad is great. It feels very smooth. Out of the box if you run your finger from one side to the other it travels the whole screen. The two buttons are hidden underneath the pad, so they’re there if you need them, but not in the way when you don’t. It’s large. best touchpad i’ve ever used

The touch screen is nice. I wasn’t sure how much i’d use the touch screen, but I find myself using it every day. It’s almost “automatic” now without me thinking about it. It sort of became ingrained in me the same way the hot corner on the activities overview did. I even find myself touching my monitor at work now without realizing it. I get frustrated when an app doesn’t support swipe scrolling

The keyboard backlight doesn’t work if you install using EFI. This is a limitation of the samsung-laptop kernel module

The keyboard is quite good. The buttons are large and spaced well. They have reasonable travel, and a nice tactile feel. The arrow keys are in a good place, and the backspace key is full sized.

I don’t normally use the speaker on my laptop much (if i’m using sound, it’s usually through headphones, but the speakers on this laptop are the best i’ve personally had in a laptop before. This despite them being bottom facing. They’re loud and vibrant. I’m sure there are better multimedia laptops out there, but this was a pleasant surprise

The software isn’t 100% ready for the screen resolution. A lot of effort has gone into making hidpi screens work well from Alex, Emmanuele, etc, but there are still some rough spots. gnome-shell doesn’t support hidpi directly, yet, just font scaling, so it ellipsizes some text it shouldn’t. When logging in with a lower than native mode, the text gets scaled way too big. Firefox needs a layout.css.devPixelsPerPx = 2 option to be readable. Just some niggles here and there

It doesn’t always wake up from suspend

There are various video driver problems. Sometimes tiling artifacts show up on screen when scrolling or mousing over an icon in the dash. Occasionally the screen shows solid white instead the scan out buffer at boot up (has happened twice so far in the weeks i’ve owned it).

Xorg shows a huge list of resolutions but omits the two most useful ones aside from 3200×1800 (the 16:9 ones: 1080p and 1600×900). It’s easily fixable with a file in xorg.conf.d, though. And most of the time I run in 3200×1600 anyway. Note, though, 1080p and 1600×900 both look fantastic too. It’s not like with most screens where if you go non-native everything turns fuzzy. I’m not sure why.

My model only has 4GB of ram and a 128GB ssd. There is a newer model available for preorder now with 8GB and 256GB SSD though.

Battery is pretty fantastic. I get like 7-8 hours on it unplugged

Overall, I really like the laptop. There are some issues, but for the most part it’s a big upgrade.

September 8, 2013

So, my blogging has started to peter out again and in the interest of preventing that from happening, I decided to do another smartcard post.

I have a few smartcards: an old cyberflex e-gate card, a couple of Gemalto GemPlus java cards, and a military-style Common Access Card (CAC). When developing, I usually only use one of the gemalto tokens, because the e-gate card is obsolete and no longer supported in Fedora, while the Common Access Card is touchy and will eat itself as a security measure if it’s accessed the wrong way too many times.

Sometimes, though, it’s nice to be able to do testing without having to constantly plug and unplug a card. I used to do this using usb passthrough from a smartcard attached to the host to a qemu guest and then simulate insertion and removal via the qemu command console.

Recently, I discovered a better way by using virt-viewer and the SPICE protocol. It has the ability to emulate a Common Access Card in the guest merely by using certificates generated on the host. No smartcard required. This post is going to talk about the details of that.

Note, smartcards usually have one certificate for encryption and one certificate for signatures. There are no hard and fast rules, though. A card can, in theory, have a bunch of certificates. In addition to the aforementioned encryption and signing certificates, CAC cards also have another certificate called an “ID certificate”. The U.S. military assigns a number for every person signed up, called the EDIPI. It’s essentially a UUID used as an identifier used by various services. CAC cards store this number in the third certificate. It’s normally not used for authentication (though it actually can be used for authentication given the right configuration). The only reason I’m bringing up the ID certificate is because virt-viewer needs 3 certificates to emulate a CAC card, and I wanted to explain why it’s common to find 3 certs on CAC cards.

Before we jump to the certificate setup, though, we need to get the guest in order. First step is to shut the guest down and add a “smartcard” device in virt-manager and set that device to “passthrough” mode. This change allows the guest to create the virtual smartcard when virt-viewer passes along the right information. That information, is of course, the 3 certs I mentioned above.

So to generate the certs, we first, need to create a place to store the certificates.

At this point you’ll be asked for a password. This password becomes the smartcard PIN code on the guest system.
You could also just use the system NSS database (/etc/pki/nssdb) on the host directly, but then the smartcard would have a blank PIN (unless you put a password on the system NSS database).

The CA certificate is a toplevel certificate that the other certificates can chain up to. In a real deployment, a group of provisioned smartcards would have one common CA certificate. That CA certificate gets imported on the machines that allow that group of smartcards access. A machine can look at the smartcard, link it back to the CA certificate and know that that smartcard is trusted and allowed. This prevents having to import every certificate from every smartcard onto every machine. Instead, just the one CA certificate can be imported and the rest of the certificates are implicitly trusted by association with the CA certificate. Of course, in our situation it doesn’t matter much. We could just generate the three certificates without using a CA cert and import each one manually, but that’s neither here nor there.

So, now that we have the CA cert from the host, we need to export it to a file for the guest:
host$ certutil -L -r -d sql:$PWD -n fake-smartcard-ca -o fake-smartcard-ca.cer

The order of the passed in certs is important. It always goes ID, Signature, and then Encryption.

At this point the guest has a virtual smartcard. It can be “inserted” and “removed” by hitting shift-F8 and shift-F9, respectively (make sure you have the outer virt-viewer app focused and not the guest inside virt-viewer).

You can verify everything is working by first getting a list of certs:
guest# certutil -d /etc/pki/nssdb -L -h all

It should ask for your smartcard PIN (which is the password you assigned to the certificate database early on), and then show you all three certs and the manually imported CA cert.

You can look at the individual certs, too, and make sure they match what you generated on the host (e.g.):
guest# certutil -d /etc/pki/nssdb -L -n "John Doe:CAC Email Encryption Certificate"

If pam_pkcs11 is installed and configured, the smartcard should work for login as well. By default it will associate with the user that has “John Doe” in the GECOS field of the password file. There are other ways to configure how pam_pkcs11 maps the card to a user account, too, but I won’t get into that here.

This kind of setup isn’t appropriate for every situation, but it is still very useful, in a lot of cases, for quickly doing smartcard related development/testing.

August 18, 2013

So right around the time the Chromebook pixel came out, I started thirsting for a new laptop. It seemed to get so much right: gorgeous hidpi screen, sleek form factor, solid build materials, and a really good touchpad. It had some problems, too, though. For one, it was designed to heavily leverage “the cloud”, so it came with a small, non-upgradeable hard drive. It didn’t come with any USB 3.0 ports, either, so using an external drive wasn’t a super desirable option either. The keyboard boldly drops some keys that are actually kind of important in Fedora. But the real killer for me, was that in order to run Fedora on it, you had to hit a key at boot, every single boot. These are all understandable tradeoffs given the intended point of the laptop, but I wasn’t planning to use it the way it was intended to be used (as a ChromeOS cloud based machine).

So I hung tight. Then one day the Toshiba KIRABook was announced; another attractive looking, HiDpi machine (though it has an ugly shiny ring around the touchpad). I very nearly pulled the trigger on that one in May, but it suffered from being based on the soon to be superceded 3rd generation intel chipset. The Haswell line of intel chips, was set to be released in early June, along with (I assumed) a long line of Haswell based laptops. After discussing things with my wife, we both decided it would make sense to wait a bit.

Over the next few weeks a few hidpi contenders were announced: The Fujitsu Lifebook UH90, the ASUS Zenbook Infinity, and the Samsung ATIV Book 9 Plus all got presented at expos or announced in press releases. I waited for one of them to become available for retail. And waited. And waited. June went by, July went by, and August creeped in. Today, according to a Samsung Press Release a while back, the ATIV Book 9 plus is supposed to available for pre-order on the samsung website. Indeed the [Add To Cart] button is now there on the product page! Clicking it fails to actually add it the cart though. Oh well, probably a glitch they’ll fix tomorrow when the work week starts up again.

It’s also available on Amazon for pre-order now for a couple hundred dollars cheaper (or really the same price if you add in the SquareTrade extended warranty). It unfortunately only comes with 4GB or ram and 128GB M.2 SSD. I took the plunge anyway. I’ve gotten too impatient and can’t wait anymore! Low ram aside, it’s actually a pretty sweet looking machine. It should be arriving in the next 2 to 3 weeks, unless I get buyers’ remorse and cancel my order.

August 16, 2013

August 16, 2013

So, today I’ve decided I want to talk a little about smartcard support in the login screen and unlock dialog.

In the GNOME 2 days users could log into their machines with a smartcard by installing the GDM simple-greeter smartcard plugin. It managed tweaking the UI of the greeter, and calling out to the pam_pkcs11 PAM module (by means of a gdm daemon d-bus interface). Once they were logged in, they could configure their system to lock or log out when the smartcard used to login was removed.

In a GNOME 3 world, the login screen is handled by gnome-shell, so the smartcard plugin is no longer applicable. Early on, we brought support to the gnome-shell based login screen for fingerprint login, but we punted on smartcard support since it’s more niche and we had more pressing issues. So, getting smartcards functiong again is something that’s been on my TODO list for quite a while.

The first step was to resurrect the old gnome-settings-daemon smartcard plugin. It was disabled due to bitrot (see bug 690167). It was previously used soley in the user session for handling the aforementioned screen locking/forced logout on smartcard removal. In a GNOME 3 world, the login screen session uses all the standard gnome components, including gnome-settings-daemon, so we have the oppurtunity to do a nicer job architecturally, by leveraging the settings daemon for the login screen’s smartcard needs as well.

The login screen doesn’t actually need to know a lot about smartcards; all it really needs to know is when a smartcard is inserted or removed, so it knows when to fire up pam_pkcs11 to do the heavy lifting and so it knows when to reset the login screen back afterward. So, the gnome-settings-daemon’s smartcard plugin needed to advertise what smartcards are known and when they get inserted and removed. This is a perfect use case for the D-Bus ObjectManager interface. Every smartcard token gets represented by its own D-Bus Object, and that object implements a org.gnome.SettingsDaemon.Smartcard.Token interface that has an IsInserted property (among other properties). Thankfully, a lot of the murky details of implementing this sort of thing are code generated using the gdbus-codegen tool. But, of course, once the gnome-settings-daemon support landed, we needed to hook it up in the shell.

But before I could do that, we had to do some code clean up. You see gnome-shell has a login screen and an unlock dialog, that both look and act very similar but aren’t implemented using the same code. Adding new features (such as this) would require a lot of duplicated code and effort. Another long punted TODO item was to consolidate as much of the two codebases as made sense. Now most of the two features are implemented in terms of a common AuthPrompt class (see bug 702308 and bug 704707).

Once the two codebases were consolidated we needed to track the smartcard tokens exported from gnome-settings-daemon. As mentioned it uses the D-Bus ObjectManager interface. But gnome-shell doesn’t have good built-in support for using that interface. So the next step was to add that (See the patch here). Once it was added, then there just needed to be a thin wrapper layer on top implement the smartcard management. See bug 683437 for all the details.

August 15, 2013

So one thing that inevitably happens when working on enterprise distros is the number of patches to a package builds up over time. It makes sense; part of what a customer pays for is backports and fixes. The other day someone sent me a mail asking how I deal with the patches in a package. I figured it would be useful to write it out here, too.

First, (much like in Fedora), RHEL packages are stored as git repositories. They’re stored as a .spec file, the patches, and a file with a checksum for the source tarball. The patches themselves may be done in many different styles (git-am formatted patches, patches generated with the gendiff command, or just manualy diff’s). Each patch file may have multiple patches in it together. For instance, in gnome-screensaver for RHEL6, there is a patch file called “fix-unlock-dialog-placement.patch” that has 2 patches in it, one that makes the unlock dialog always show on the primary monitor, and another that forces the mouse pointer to stay on the primary monitor. So in my workflow, it’s one file per fix or feature, but potentially more than one patch in a file.

As issues are corrected, older patches may need to be updated to fix bugs in them, and those changes may require rebasing subsequent patches from other files so they still apply on top.

The easiest way to work with the patches is to apply them to the upstream source in a git repository, so the standard git rebasing machanisms can be used.

That’s where this script i hacked together sometime ago comes in for me. The way it works is I clone the upstream git repostory, then run
git checkout -b rhel 2.28.3
(where 2.28.3 is the upstream tag that tarball is generated from)

then run

git spec-apply ../path/to/spec/and/patches/dir/gnome-screensaver.spec
(where gnome-screensaver.spec is the name of the spec)

It then goes through and applies all the patches in the spec file one-by-one to the branch, and creates sub-branches for every patch file. If i’m going to work on one particular feature or fix, I jump to that branch to fix things up. When i’m finished I run git format-patch to regenerate the patch file. If necessary, I rebase the patches that apply on top as well and regenerate them too.

So that’s my workflow. I’m not really advocating the script for others to use. There’s a much more complete program that Federico wrote that works on SRPMS instead of expanded spec file + patches trees. I haven’t yet tried to fold it into my workflow, but want to eventually.