Saturday, October 2, 2010

Thoughts on Linux multitouch

Note: I am speaking for myself here, in my role as input maintainer for X.Org. The opinions here are my own and may not be the same as those of my employer, other X.Org developers or anyone else alive or not.

Two weeks ago, I was in Toulouse, France, at a multitouch workshop organised by Stèphane Chatty. After the workshop, in the same week was XDS. The workshop had a nice mix of people, Benjamin Tissoires whom I credit with kicking off much of the multitouch work with his evdev hacks, Pengfei, a PhD student of Stèphane, and Chase Douglas from Canonical, involved with their multitouch efforts. Ping Cheng from Wacom, Gowri Ries and Pascal Auriel from Stantum represented the hardware side. And Zeno Albisser and Denis Dzyubenko from Nokia for the toolkit side (Qt). We talked multitouch for two days and I think we got a lot done - not in code but in concepts and general design questions. The current state is essentially that both hardware and software guys are waiting on us, the X server, to integrate multitouch.

The workshop also happened shortly after the Ubuntu multitouch announcement and ended up being an excellent setting to discuss what was, is, and will be happening in regards to multitouch support on Linux. This post is a long one, take your time and don't just skim through. I will focus more on concepts and approaches, not on hard technical stuff. That is for a later post, mainly because I've been on vacation since and I have not yet caught up with new patches.

Multitouch basics

Let's start from the basics. Traditional input devices such as mice provide one physical device to control one virtual input point (the cursor or the keyboard focus). At it's very basic, multitouch is the ability to use multiple input points on a single physical device. This has some implications on technical issues but many more implications on the UI design. It is also the main reason why multi-pointer (which we support since server 1.7) is not a replacement for multitouch.

Multitouch comes in roughly two different forms: direct touch and what I call dependent touch. Direct touch is simple, a touch point directly relates to the position on the device, i.e. a touch in the upper right corner of the touch device will usually result in some action in the upper right corner of the screen. Dependent-touch devices are e.g. multi-touch aware touchpads but also devices like the Apple Magic Mouse. In their case, the touch is usually interpreted relative to the pointer position of that same device.

The usage of such devices differs vastly too. A direct-touch device will likely be used as such, with plenty of direct object manipulations at their respective position. This again has UI implications: a common design challenge for direct-touch user interfaces is to ensure that all objects are within reach of the user. By the way, from a purely technical point of view, the "touch" part of direct-touch is irrelevant.

A dependent-touch device has a stronger dependency on gestures; a simple touch on such a device doesn't necessarily have a meaning. Touch capabilities on these devices provides an augmentation of traditional input actions, similar to extra buttons or knobs. As usual, the world isn't black and white only and this is only a rough description with plenty of corner cases.

The lack of context

The hardest thing about multitouch is the difference between the data we get and the context the user assumes. Touch devices see touches appear and disappear but cannot (yet) identify appearing and disappearing touches and their context across time. I'll give you one example of the mismatch between the data we get and the context a user may assume: In a drawing application, the following events happen: First, a touch appears on the colour selector and selects the colour red, then the touch disappears. Second, a touch appears on the colour selector and selects the colour green, then the touch disappears. Then two touchpoints appear on and move across the canvas.

Possible scenarios that cause the above datastream are:

The user has erroneously selected red, then corrected to colour green, now is painting with two fingers in green.

The user has selected red with one finger, green with another finger and now wants two paint with two different colours.

The user has erroneously selected red with one finger, corrected to green and now wants to paint in green and the colour the other finger already had assigned to.

Two users selected two different colours and now paint simultaneously.

Which one it is is impossible to know from the data stream. Hence the difficulty of creating good multitouch interfaces, the data we get only carries a fragment of the information that the user assumes from the context. It is largely a UI design challenge to create a UI where the mismatch does not lead to problems.

Multitouch evolutions

One interesting comment from Stèphane was that he and the Intuilab were building more multi-touch UIs a few years ago than he is now. Back then, the interfaces were for specialists who could afford training and thus complex interactions. These days, multitouch is mostly for ad-hoc interaction without any previous training on the UI. Thus, the UI must be "intuitive" and easy to understand. The result of that is that modern multitouch is more often than not just two-finger touch. This also matches the (little) experience I've had with the iPhone so far.

Another thing I noticed again on our tour through the Intuilab is that multitouch interfaces are mainly aimed for consuming content. Some interfaces to assemble existing content are used (photos, slides, etc.) but real content creation seems rare and still relies heavily on settings with more traditional input devices.

A tour through the ENAC LII showed another interesting change - a system for ATC training that was using touchscreens exclusively a few years ago has since been switched to use Wacom Cintiq tablets with styli. As with any device, multitouch devices are no silver bullet, they are great for some tasks but not a complete replacement for all other devices. Using them appropriately is the key to success. Bill Buxton's often-cited mantra of "Everything is best for something and worst for something else" applies here as well.

Apple, Microsoft and other multitouch products

Why is it taking us so long when there's plenty of multitouch offerings out there already? The simple answer is: we are not working on the same problem.

If we look at commercial products that provide multitouch, Apple's iPhones and iPads are often the first ones that come to mind. These provide multitouch but in a very restrictive setting: one multi-touch aware application running in full-screen. Doing this is suprisingly easy from a technical point of view, all you need is a new API that you write all new applications against. It is of course still hard to make it a good API and design good user interfaces for the new applications, but that is not a purely technical problem anymore. Apple's products also provide multitouch in a new setting, an evironment that's closer to an appliance than a traditional desktop. They have a defined set of features, different form factors, and many of the user expectations we have on the traditional desktop do not exist. For example, hardly anyone expects Word or OpenOffice to run as-is on an iPhone.

The main problems we face with integrating multitouch support into the X server is the need for the traditional desktop. Multitouch must work across multiple windowed application windows, with some pointer emulation to be able to use legacy applications on a screen. I have yet to see a commercial solution that provides this, even the Microsoft Surface applications I've played with so far only emulate this within very restrictive settings. (Disclaimer: I do not own an iPhone, iPad or Surface table so there is the chance that I've missed some new development). We cannot simply ask all developers to rewrite their applications as multi-touch aware and completely ditch the existing ones. Note that e.g. Android can do this, for the same reasons that Apple could.

Having multitouch work across any application window, with fluid transitions between pointer/keyboard interfaces and multitouch interfaces is tricky. And of course, we need to stay compatible with the core protocol and its extensions. Funnily enough, many of the issues we face with multitouch are similar to the ones I faced years ago during the MPX development phase. This is one reason why I am more confident but also quite cautious about it, I already know how it can go wrong.

Replace multitouch with multiple users and you have exactly the same requirements I had for MPX (and XI2). Why exactly these four requirements I'll explain separately one day, but the short answer is: zero transition cost between multitouch and non-multitouch interfaces.

Gestures

Whenever multitouch is mentioned, gestures are brought up quickly too. Gestures are prevalent in this context because they appear to be a natural extension of touch input. They are, but only to a very limited extent. The set of natural gestures is surprisingly small and goes barely beyond the well-known tap, pinch, rotate and swipe. As an exercise, think of the natural gestures for copy and paste. Then see how many of your friends come up with exactly the same gesture.Moreover, gesture discussions often focus on multi-finger gestures. In the not-too-distant future off-the-shelf hardware will support arbitrary object shapes and that's when hand and arm gestures will become relevant too.

Gestures do not rule out touch. Especially on direct-touch devices it is likely that a UI will use gestures as shortcuts to complex operations but the vast majority of interactions just use plain touch input for object selection, moving objects, and invoking menu commands. So it is important to think of gestures as an abstraction of some touch input but it is by no means the sole vehicle of deliverying touch input. From a technical point of view, this means that gesture recognition must not be in the way of touch recognition.

Gesture detection introduces a delay and that is unavoidable. What can be avoided is the effect of this delay on the UI. With good gesture design the delay is irrelevant. Take the example of a system with two gestures: a one finger hold gesture and a four finger hold gesture (i.e. an action is triggered when one or four fingers are put onto a position). Assume that the timeout for the gesture is 2 seconds. When a user puts a single finger on the device, the system now has to wait 2 seconds before it can react. If within 2 seconds the user does not put three more fingers down, the single-finger action can be taken. Decreasing the timeout lands the UI designer at the blurry line between a laggy UI and too-hard-to-trigger gestures.

To solve this issue, the system must be able to recognize which gesture is performed before the user expects feedback. Consider the gesture set of tapping one finger, moving one finger and a four finger hold gesture. Tapping and moving are easy to detect and they don't suffer from an artificial delay. The only time the timeout is needed is when the user puts a finger down but doesn't move or release it immediately. Now the system has to wait for more fingers. However, since a single finger not moving is not a defined gesture, the user expects no feedback and the timeout is irrelevant.

Integrating gesture support is thus not a matter of simply defining a few gestures. It's a matter of selecting the right ones that do not expose the user to a laggy experience. The FOSS software stack is notoriously segregated and thus true gesture integration must be done across window managers, desktop environments and applications. Coincidentally, this is my main worry with the Unity gesture support, I feel the gestures have a potential to overlap in some cases and thus be detrimental to the user experience. I hope I'm wrong here.

X.Org multitouch support

When it comes to input event hanlding, the X server's job is to convert a set of hardware events into a set of events meaningful in a GUI. This includes converting the hardware information into screen-relative information and picking (selecting where to send the event to). I've already outlined the four requirements for multitouch above.

Originally, I thought multitouch could be integrated into our current event system but a few months back I gave up on that. Whenever I tried to sort out the corner cases, there were some situations where the core protocol would be unresolvable. The current aim is that we add new events and they will live side-by-side to pointer and keyboard events, not necessarily influencing those. This means that clients must add support for multitouch to be useable, even XI2 enabled clients will not provide that functionality without modifications (pointer emulation for single-finger touches is still considered as required feature though).

Beyond that, things are still in flux. At the time of this writing, there are two protocol proposals that are largely overlapping. Daniel Stone's version has taken a few features from mine as I have taken some of his but the basics approach is the same (and we came up with it independently, which is a good sign). Daniel's proposal is also backed by an implementation though I have yet to look at these patches. Either way, what will be happening is a new set of events and a few requests to detect which devices have multitouch. Passive grabs on touch points will be possible but that still needs to be sorted out. Once it is, I'll post more details.

The summary here is that things are moving, at least in my case largely due to Canonical's announcement which lit a fire under some southern body parts and finally made me spend more time thinking about multitouch.

Ubuntu's multitouch support

One thing that is likely of interest to many is how all this will integrate with Ubuntu's multitouch support. Ubuntu has committed to strong gesture support. This is commendable but quite limiting in my opinion as true multitouch is much more than gestures. The current UI seems aimed more at dependent-touch devices like the Apple Magic Trackpad where gestures agument traditional input, less so at a fully multitouch-aware desktop. Fair enough, so far there aren't a lot of multitouch applications out there but for system-level design we need to keep the medium and long term in mind too.

The current integration is a hack (and the developers don't try to hide this), but most good things start as hacks so there's not much to say here. I am certianly impressed that they are willing to deliver a UI based on what is still a moving target. The few issues I had were with the design, so let's focus on that.

The originally proposed X Gesture Extension tried to get a recogniser into the server (not a 100% accurate statement, but close enough). It only allowed for one recogniser in the system. I do not think that we'll ever manage to find the one true gesture recognizer, we haven't managed to find the one true window manager or toolkit either. I talked at length to Chase about this and the revised architecture diagrams Chase presented at the workshop and XDS do not call for gesture recognition in the server anymore.

With the new approach the recognition happens purely client-side and is done by a library or a daemon. This allows for multiple different gesture recognisers be active at the same time, a scenario that is quite likely to happen (I do envision GTK, Qt, Mozilla, etc. all wanting their own system). Whether that's a good thing for the UI is another matter, consistency for gestures is important and especially Ping Cheng was not happy at the prospect of having multiple, possibly inconsistent, systems. We need to find some common ground here between desktop environments and toolkits. Geis tries to address this, though it is in a quite early stage. The main issue with geis is a political one. The implementation requires copyright assignment and puts up quite a barrier to entry. The utouch team tries to get around this by separating the interface specification and the implementation, but that's just calling for a fork or duplicated efforts before there's anything to fork or duplicate. So while I think geis is the right approach and much needed, I fear the political decisions will hold it back unnecessarily. My suggestion is to ditch the copyright assignment, because something as central as a generic gesture interface should be developed together with toolkit developers, X.Org developers, application developers, etc. And that's just easier when contributions are as simple as possible.

As for the UI itself - any UI decisions already made will unlikely be an issue from a technical perspective. Gestures are an important part of multitouch support and we will make it possible to provide gestures. All the technical parts are still in flux, but the current abstractions Ubuntu has put in place are likely good enough to adjust to a new baseline. So while I cannot speak for Ubuntu at all, my guess is that in future versions the bottom bits of the stack will be swapped out but the top bits (the user-visible ones) will largely remain the same.

Conclusion

The above should provide a general overview on some of the issues with multitouch and challenges we face. I've skipped over a few and only briefly touched on others for brevity, it's quite easy to write full novels about all this.

I will post more technical details about the kernel's multitouch API and how we plan to use it in X.Org for our protocol extensions once the latter is at least partially nailed down. For now, the concepts described above hopefully framed your mind in terms of multitouch and some of the concepts we're looking at.

* The user has erroneously selected red, then corrected to colour green, now is painting with two fingers in green.

* The user has selected red with one finger, green with another finger and now wants two paint with two different colours.

* The user has erroneously selected red with one finger, corrected to green and now wants to paint in green and the colour the other finger already had assigned to.

* Two users selected two different colours and now paint simultaneously."/Quote

How about adding the expectation that the hardware shall implement a mechanism to identify a touch gesture with an id number. The hardware would then be able to report that the first touch has id "abc" and second touch has id "def".

Now "abc" is associated with red and "def" is associated with green, and the two touch gesture will report the id & position of each touch to ensure the proper color is drawn

For simple multi-touch hardware that doesn't implement this mechanism, assume that all touch events have the same id.

Since all touches have the same id, then the user(s) selected red, then green, and are now drawing two green lines.

I can understand if this expectation seems a bit absurd, however, some multi-touch surfaces are demo'd on youtube that use physical objects and cameras to identify the objects (size, color, # of dots on d6 dice, etc). Who knows, maybe future touch surfaces will be able to see the prints on your fingers as you touch and convert that into a reliable id number.

Did you end up finding a solution to the problem of working out how to map multi-touch events to windows when those events originated outside of the windows? I lost track of the discussion at some point, but couldn't help wondering whether it could be solved by a multi-touch-aware window manager. Said window manager could decide on an area around the window which "belonged" to that window and visually highlight it however seemed appropriate. That would have elimitated a lot of ambiguity (for the user too) and got rid of any need for grabbing touch points.

Sorry if that sounds like a design discussion, but since you have probably already taken your own decisions there you can just treat it like musings.

Very interesting post, especially for a HCI researcher like me who also works with multitouch devices. While reading your post, that ATC training part especially caught my attention. Are there any further reasons known to you, why they switched from multitouch devices to tablets?Thanks!

@Sergio:there are no specs that I know of, at least not in regards to low-level MT support. For gestures there's a quite bit of literature out there.

@Justin:Unfortunately, HW that can identify fingers is quite far on the horizon. Identifying objects is relatively simple compared to identifying fingers.

@Michael:No final decision yet but it will likely be that a touch will never be sent to a different window than the one underneath, unless there's a specific grab. Coincidentally, this seems to be the MS Surface behaviour too.

The main problem with an area to grab is that especially given the fat finger problem, it's really hard to make this work properly. As with many other things, it's something that needs to be addressed in the UI design to avoid the problem in the first place.

@Jürgen:IIRC, the main reason were the lack of ghost touches. Touch screens provide erroneous data if a user leans on them or even just rests the hand on them. Leaving the hand up at all times also causes fatigue. While this can be worked around in vertical settings or tabletop settings, it wasn't possible in this particular setup which was essentially a workstation with a near-horizontal screen on the desk in front of the user.

Another point mentioned was the hover ability of the tablets, so there can be some UI features based on the hover.

Great post, thanks for summarizing the state of the art. It would be helpful if you could elaborate a bit on the Ubuntu gesture work. If not an X gesture extension, how do events get delivered to applications? I understand the different parts of the utouch stuff -- geis, grail -- but not how it all fits together for getting gestures from evdev through the X server to applications.