An Empirical Comparison of On-Screen Keyboards

Abstract

Eight different designs for an on-screen keyboard were compared. Some
of the factors manipulated included QWERTY vs. alphabetic layout, key-caps
appearance vs. none, and mouseover feedback or not. We found that a design
featuring two complete QWERTY keyboards (upper and lower case) yielded
the best performance, but the designs that looked the most like a standard
keyboard were preferred.

Introduction

On-screen keyboards might be used in a variety of situations where it
is not possible, practical, or even wise to use a normal keyboard. Perhaps
the most widely known situation is on a PDA that does not have a keyboard.
Tablet PCs are another example. But even with desktop PCs there are cases
where an on-screen keyboard might be more appropriate to use. One example
is for someone who has difficulty using a normal keyboard due to a disability
or injury. A less widely known situation is one where the security of
the PC in use might be compromised, and the user needs to login to a remote
system to check email or perform other transactions. This might particularly
be true when using a public PC.

So-called "keyboard sniffers" record the keystrokes typed on
a PC for future analysis. These can be small hardware devices that are
installed between the keyboard and CPU or they can be "spyware"
software that records keystrokes in the background. In either case, the
user may not be aware of the existence of the keyboard sniffer. By analyzing
the log of keystrokes from a keyboard sniffer, someone might be able to
detect repeated patterns which represent a user ID and password being
used to access a remote system. One way to thwart a keyboard sniffer is
to use an on-screen keyboard for the entry of very sensitive items, such
as a password for entry to a remote system. In this case, the sensitive
information is actually entered via clicks at certain coordinates on the
screen rather than via the keyboard, so the keyboard sniffer has nothing
to record.

There have been a few previous studies investigating certain types of
on-screen keyboards (e.g., MacKenzie & Zhang, 1999; Sears et al 2001;
Zhai et al, 2001), but none have compared a broad range of on-screen keyboards
to each other.

Methodology of the Study

Overview

The goals of the study were to assess user performance as well as subjective
preference for a variety of on-screen keyboards in the entry of a password.
We conducted it as an online study on our company's Intranet in order
to get a large number of participants. We let the participants choose
the password they would use for the study from a set that we provided.
This was done to reduce the variability in the passwords used and to prevent
the use of extremely simple (and non-secure) passwords. The passwords
that we provided all met standard criteria for highly secure passwords,
meaning that they contained all of the following types of characters:

Upper- and lower-case English character(s)

Digit(s) (0-9)

Special character(s), e.g., #$%^&*(

Specifically, all of the passwords we provided contained two randomly
selected English words with one upper-case letter each, two digits, and
one or two special characters. The following passwords were provided for
the participants to choose from:

Bagel_Ocean38

4PlaneBoot5?

!Pixel7Flute2

9Draft#Busy6

Leafy4Speak>7

!Heavy*List19

The On-screen Keyboards

As shown in Figures 1 through 8, we created eight prototypes of various
on-screen keyboards that utilized different layouts, visual appearance,
ordering of the characters, and visual feedback. Two of the on-screen
keyboards included a random feature (#6 and #7). This is for a possible
security advantage that might be offered by having the mouse clicks occur
in slightly different locations on each login.

Study Procedure

Each participant was randomly assigned to a condition set
that consisted of the 8 on-screen keyboards randomly ordered so that any
practice or fatigue effects would be distributed across the conditions.
This was a within-subjects design, with each participant "logging
in" four times with each on-screen keyboard condition.

Their first task was to select the password they would use for the rest
of the study from a drop-down list displaying the six "strong security"
passwords listed earlier. They then had to type that password. The program
checked to make sure that the user truly had typed the password correctly,
including letter case. If not, they were given error feedback and not
allowed to proceed until they typed the password correctly. This was to
ensure that the user understood the proper way to enter the password.
Their chosen password was displayed to them on all subsequent pages below
the area where they had to enter it. After this initial check, no feedback
was given to the users about the accuracy of their entries (either in
the control condition or the on-screen keyboard conditions).

Next, users were taken to the control condition, for normal entry of
their chosen password four times via the keyboard. Users then attempted
to login four times with each on-screen keyboard, and after each condition
they were taken to a rating page. Here they were asked to rate how easy
they thought that on-screen keyboard was to use on a 5-point scale.

Data Collected

For each condition we automatically recorded how long it took for each
login attempt, the actual password entered on each attempt, and whether
it was correct. After each condition we recorded the user-provided rating
of the ease/difficulty of using that on-screen keyboard.

Results

Since data from the participants was recorded at various points during
their completion of the study, we have data from different numbers of
participants at these various points, as follows:

672 people started the study.

548-581 completed each of the eight on-screen keyboard conditions.

In analyzing the data, we adopted the following rules for excluding some
data from the analyses:

We excluded any trials where the user obviously did not make a valid
attempt at entering the password chosen, such as any trials where the
user keyed only 1 or 2 characters or keyed random characters.

We dropped the time data for any trials where the user entered the
password in less than 1 second. This only applied to the Control Condition,
and was indicative of the user having copied the password to the clipboard
and pasted it on the subsequent Control trials.

We dropped the time data for any trials where the user took 100 seconds
or longer. This was to detect any cases where the user got interrupted
or otherwise distracted.

Time Data

Figure 10 shows the mean time that each login attempt took the participants
for each of the on-screen keyboards as well as the Control condition.

As expected, users were significantly faster entering the
password via the normal keyboard (Control condition) than any of the on-screen
keyboards. In fact, they were about twice as fast with the normal keyboard.
All of the on-screen keyboards yielded somewhat similar times, with means
ranging from 18 to 22 seconds. The two on-screen keyboards with a random
feature (Conditions 6 and 7) took slightly longer than any of the others.
Conditions 2, 4, and 8 yielded slightly shorter times than some of the
other on-screen keyboards.

Figure 11 shows the average time for each condition across the four login
attempts. All of the on-screen keyboard conditions showed a significant
practice effect, with the greatest improvement in time coming between
the first and second logins. The Control condition showed almost no practice
effect.

Figure 11. Time per login attempt for each trial and
each condition.

Error Data

Figure 12 shows the error rate for each of the conditions. These data
represent the percentage of time the users entered a password that did
not exactly match their chosen password, including letter-case errors,
substitution errors, etc. It appears that case errors were the most common
type (e.g., "Bagel_OCean38" instead of "Bagel_Ocean38").

Figure 12. Error rates in password entry for the conditions.
Error bars represent the 95% confidence interval for the mean.

Overall, the error rates were relatively high. This could
have been due to several factors:

The participants were not using a password that they were already
familiar with.

The passwords that they could choose from were all highly secure passwords,
with upper- and lower-case letters, digits, and special characters.
In spite of security guidelines, most people do not use passwords this
complex.

The participants were not given feedback about the accuracy of their
entry (after the very first entry, which did not count, that ensured
the user knew the correct password).

The error rates for all of the on-screen keyboards were
all significantly higher than for normal entry via the keyboard (the Control
condition). But the error rate for one on-screen keyboard, Condition 4,
was significantly lower than the error rate for all the other on-screen
keyboards.

Subjective Ratings

There were two subjective ratings given to each on-screen
keyboard: one immediately after using it and another at the end of the
study after they had used all of the keyboards. The mean ratings given
immediately after using each keyboard are shown in Figure 13.

Figure 13. Mean ratings of ease of use, on a 5-point
scale, given immediately after using each on-screen keyboard. Higher values
are better. Error bars represent the 95% confidence interval for the mean.

Conditions 2 and 8, which are the two normal on-screen QWERTY
keyboards (larger and smaller), received significantly better ratings
than all other on-screen keyboards except Condition 4. Conditions 6 and
7, which are the two with a random feature, received significantly worse
ratings than most of the other on-screen keyboards.

Discussion

As often happens in usability studies, the performance data
and subjective ratings show quite different patterns. Since the time data
for the various on-screen keyboards did not show differences of any practical
significance, we have focused more on the error data as the primary performance
measure. It is clear from the error data that Condition 4 came out significantly
better than all of the other on-screen keyboards. This condition used
two complete QWERTY keyboards, one for upper case and one for lower case,
stacked vertically. Key-cap appearances were not used, but visual highlighting
on mouseover was used. The preference data, on the other hand, shows clearly
that the users preferred the two on-screen keyboards which look the most
like a traditional keyboard (Conditions 2 and 8). This is perhaps due
to familiarity. Interestingly, though, Condition 4 (which yielded the
best performance data) yielded the next highest ratings (after 2 and 8).

Overall, we believe Condition 4 appears to be the best alternative.
We also believe that the following general conclusions can be drawn from
this study:

None of the on-screen keyboards studied came close to the level of
performance for the Control condition (normal entry via the keyboard).
Most users are simply too highly skilled at using a normal keyboard.
Those motor skills do not translate well to an on-screen keyboard.

The random features studied here (Conditions 6 and 7) negatively impacted
both performance and subjective preference. Any benefit that such features
might offer from a security perspective would have to be weighed against
that.

Taken together, the two on-screen keyboards that included mouseover
highlight (3 and 4) came out better than most of the other on-screen
keyboards. Many participants even commented that the highlighting helped.

The on-screen keyboards that required the use of an on-screen Shift
key (Conditions 2, 5, 6, 7, and 8) yielded higher error rates, as a
group, than those that did not.

Although the on-screen keyboard with the lowest error rate was one
that used a QWERTY layout, the next-lowest error rate was for one that
used an alphabetic layout (Condition 3). Therefore, we do not believe
that a QWERTY layout is always better than an alphabetic layout for
on-screen keyboards.