Automatic web usability evaluation:
what needs to be done?

Giorgio Brajnik

giorgio@dimi.uniud.it

Abstract

Website redesign
and maintenance are likely to absorb more and more resources as web technologies
and uses keep evolving at the current pace. Usability evaluation methods
need to be run after each change in order to ensure a decent quality level.
The means to control the complexity and cost of website maintenance lies
in tools performing automatic usability evaluations.

I present a survey
of tools that analyze websites, illustrating what kind of automatic tests
they perform and which usability factors the tests are more closely related
to. The survey then leads to an analysis of the still remaining gaps and
of research openings.

1. Introduction

It is well known that the average quality of websites is poor, “lack of
navigability” being the #1 cause of user dissatisfaction [Fleming, 1998;
Nielsen, 1999].

On the one hand web technologies evolve extremely fast, enabling sophisticated
tools to be deployed and complex interactions to take place. Secondly,
the life cycle of a website is also extremely fast: maintenance of a website
is performed at a rate that is higher than that of other software products
because of market pressure and lack of distribution barriers. In addition,
often the scope of maintenance becomes so wide that a complete redesign
takes place.

On the other hand, the quality of a website is rooted on its usability,
which usually results from the adoption of user-centered development and
evaluation approaches [Newman and Lamming, 1994; Fleming, 1998; Rosenfeld
and Morville, 1998; Nielsen, 1999]. Usability testing is thus a necessary
and repeated step during the life-cycle of a website.

To test usability of a website a developer can adopt two kinds of methods:
usability inspection methods (e.g. heuristic evaluation [Nielsen and Mack,
1994]) or user testing [Nielsen, 2000]. Heuristic evaluation is
based on a pool of experts that inspect and use a (part of a) website and
identify usability problems that they assume will affect end users. With
user testing, a sample of the user population of the website is
selected and is asked to use (part of the) website and report things that
they think did not work or are not appropriate.

Even though the cost (in terms of time and effort) of both methods is
not particularly high, and their application improves the website quality
and reduce the overall development cost, they are not systematically performed
at detailed levels on every different part of a website after each maintenance
or development step.

It is clear that as change actions on a website increase rapidly in
number and variety, more and more resources need to be deployed to ensure
that website quality does not decrease (but hopefully increases). It is
also clear that any tool that can, at least in part, automate the usability
evaluation and maintenance processes will help to fill this ever widening
gap.

The goal of this paper is to present a brief survey of what these tools
do and how they contribute to the usability evaluation problem. From the
analysis it appears that gaps exist between what these tools achieve and
what is required to ensure usability. While some of these gaps are inherently
unsolvable, other ones can probably be filled in, given that additional
research is carried out to identify effective techniques.

2. A software engineering view of a website

A website is an interactive software system. It interacts with at least
two different kinds of users: end users trying to achieve some goal and
developers/maintainers striving to keep the system working and improving
it.

End users can be characterized in terms of:

goals and tasks: e.g. information seeking, choosing where to
buy some specific product, buying it, writing a book review, etc.

context: user behavior during information seeking processes is
strongly affected by users’ culture, language, previous knowledge in the
field, experience in using the web.

technology: end users interact with the website through a layer
of technology that is not under control by the web designer: browsers,
protocols, plug-ins, operating system platforms, interaction devices (screens,
speaking devices, pens, reduced telephone keyboards, etc.), network connections.

Information seeking through browsing is
a process that almost all websites must support. Unfortunately, it is also
a difficult task to model and support because it encompasses complex cognitive,
social and cultural processes [Allen, 1996] spanning through interpretation
of textual, visual, audio messages, selection of relevant information and
learning.

On the other hand we have developers and
maintainers. Amongst their activities, a preminent role is played by actions
that include: corrective maintenance (i.e. fixing problems with
the website behavior or inserting missing contents),
adaptive maintenance
(i.e. upgrading the site with respect to new technologies, like new browsers’
capabilities), perfective maintenance (ie. improving the site behavior
or content), and
preventive maintenance (i.e. fixing problems in
behavior or content before they affect users). A large fraction of these
activities is aimed at detecting system failures (that is departures
from its required behavior), analyzing them and identifying faults
(that is representations, within the system, of human errors that occurred
during development – bugs).

Maintenance is meant to improve the quality
of the website. ISO9126 defines quality as “the totality of features
and characteristics of a software product that bear on its ability to satisfy
stated or implied needs” and it includes properties like maintainability,
robustness, reliability and usability that are particularly important for
websites.

Usability can be defined (ISO9241)
as “the effectiveness, efficiency and satisfaction with which specified
users achieve specified goals in particular environments”, where:

efficiency means “the resources expended in relation to the accuracy
and completeness of goals achieved”, and

satisfaction means “the comfort and acceptability of the work
system to its users and other people affected by its use”.

General properties like these are not
independent: for example, a robustness failure of a website (e.g. some
browser incompatibility) will result also in a usability failure (e.g.
user inability to complete a task and dissatisfaction).

In order to be operationalized these properties
need to be decomposed into more detailed ones that can be assessed in a
simpler and perhaps more standard way. For example, mantainability can
be decomposed into complexity of the DHTML code, its size, the number of
absolute URLs, etc.

The same applies to usability. It can be
described in terms of usability factors (like speed of use, error rate,
ease of error recovery, etc) which in turn can be reduced to other lower-level
properties. The most important properties for website usability include
those related with “navigability” (most of them taken from [Fleming, 1998]):

The latter can be further decomposed if
we narrow users' goals. For e-commerce sites, for example, other relevant
attributes can be:

how security is handled and how easy it is to get information about
it

similarly for privacy

how easy and effective it is to find the desired item

how easy and effective it is to search the catalog for an item not known
a priori

how easy and effective it is to preview an item

what are the return policies and how they are communicated

The Web Accessibility Initiative [W3C, 2000]
is an effort by the W3C organization to improve website accessibility.
They publish a set of guidelines [WAI, 1999] where
accessibility
is defined as the website ability to be used by someone with disabilities.
An accessible website:

makes content understandable and navigable: it should present its content
in a clear and simple language, and should provide understandable mechanisms
to navigate within and between pages.

While usability implies accessibility (at least
when an unconstrained user population is considered), the contrary is not necessarily
true. For example, a missing link to the home page may be a fault affecting
usability, while it does not affect accessibility.

All these properties (either those related
with usability or those related with accessibility) may be further decomposed
into more detailed ones that refer to specific attributes of the website
implementation. Actually, such a decomposition has to be done in order
to support usability inspection methods and to identify and fix faults.
For example, to determine how flexible a website is, we need to inspect
implementation (or perhaps design specifications) to determine if there
is a textual version of the page, if there are textual links that duplicate
those embedded in images, etc.

Some of these lower-level properties refer
to attributes that depend only on how the website has been designed/developed
(e.g. textual duplicates of links embedded in images) – they are internal
attributes, while others depend on the website and its usage (e.g.
how meaningful a label is) – external attributes. This is always
the case for properties referring to the content, which require some sort
of interpretation that assigns meaning to symbols in order to be assessed.

While for evaluating usability of a website
both internal and external attributes are needed, only the former ones
are amenable for automatic tests. External attributes can be evaluated
only via semi-automatic means that entail a human evaluation step. However,
tools can provide useful assistance by filtering and ranking content that
is potentially relevant (for example, by adopting statistical techniques
developed in Information Retrieval [Belkin and Croft, 1987]).

3. Automatic tools for usability evaluation

Tools that support the developer/maintainer
in finding usability faults and fixing them can be classified according
to:

location: web-based vs off-line

type of service: failure identifiers (they discover potential
failures via simulation of user actions, like filling a form; sometimes
they rank them according to severity); fault analyzers (they find
failures and highlight their causes, i.e. faults; usually they systematically
analyze the source code of the website; sometimes ranking the list of faults
according to their severity); analysis and repair tools (they assist
the developer also in fixing the faults)

information source: automatic usability analysis can be performed on
the basis of the actual implementation of a website (sources), or
on webserver logs, or data acquired during user testing (user
testing data); this paper deals only with tools analyzing website sources

scope, i.e. the set of attributes that are considered during the automatic
analysis. A classification based on scope is:

WebCriteria: available from WebCriteria [WebCriteria, 2000];
web-based; comparative evaluation of a website with respect to a benchmark
derived from similar well-established websites; failure identifier

forced downloading: links embedding an image in their label cannot
be followed without downloading the image

tables/frames/font resizing: relative sizes should be used

support of users’ goalsform coding: forms should have “submit”, “reset” buttonsmaintainabilityrelative links: URLs that are local to the website should be
relativeotherspelling: spell-check the content of pages

different media: report on the number of different media that
are used in pages/website

keywords/description: pages should have appropriate META information
to be searchable by search engines

site popularity: how many other websites point to the one under
analysis

marquee,blink: avoid animated features

Item 3 in the previous list (natural organization of information) has
not been reduced to any lower-level attribute since it refers to an external
property that cannot be assessed without human intervention. The situation
for “adequate feedback” (n. 2), “support for user’s goals” (n. 9) and “maintainability”
(n. 10) is similar, though slightly more positive.

The following table shows the range of tests performed by each of the
tools considered.

Range of tests performed by reviewed tools

TEST

Macro­Bot

Meta­Bot

Web

Criteria

A-

Prompt

Bobby

Net­Mech

Web

Garage

Link

Bot

Dr

Html

Web

SAT

LIFT

1.1 underline

*

1.2 link label consist.

*

1.3 email consist.

*

1.4 color consist.

*

1.5 backgr. consist.

*

1.6 nav-bar consist.

*

2.1 freshness

*

*

4.1 noframes validity

*

*

*

*

4.2 link to home

*

4.3 logical path

*

4.4 self-ref. pages

*

4.5 frame titles

*

*

*

4.6 local links validity

*

*

*

*

*

4.7 external links
valid.

*

*

*

*

*

5.1 site depth

*

*

5.2 table coding

*

5.3 image coding

*

*

*

*

*

5.4 download time

*

*

*

*

*

*

*

5.5 recycled graphics

*

5.6 hidden elements

*

6.1 informative labels

*

*

6.2 explicit mailto

*

6.3 missing page title

*

*

6.4 table headers

*

6.5 form prompts

*

7.1 browser compatib.

*

*

*

7.2 safe colors

*

*

7.3 link targets

*

*

7.4 HTML validity

*

*

*

*

*

*

7.7 portable faces

*

7.8 color contrast

*

8.1 image ALT

*

*

*

*

*

*

*

8.2 other ALT

*

*

*

*

*

*

8.3 imagemap links

*

*

*

8.4 auto-refresh

*

*

8.5 forced downlding

*

8.6 resize

*

9.1 form coding

*

*

*

10.1 relative links

*

11.1 spell checking

*

*

*

11.2 different media

*

*

11.3 keywords/descr.

*

*

*

*

11.4 site popularity

*

11.5 marquee/blink

*

*

*

*

TOTAL

1

2

4

9

8

9

8

8

10

12

34

4. Analysis

The table shows a relatively sparse set of features. In particular:
there is no tool dealing with external properties related with item
3 (“natural organization of information”). Similarly for the the other
two items pinpointed in the previous section (i.e. “adequate feedback”
and “maintainability”). Adequate feedback requires either or both an interaction
based on pages containing information that conveys such meaning or relatively
complex programmatic actions that are more difficult to analyze automatically
(for example because they are written in javascript instead of plain HTML).
Maintainability on the other hand does not affect usability and therefore,
probably, is not related to the goal of those tools.

Most frequently adopted tests are the download time of a page, presence
of alternative textual descriptions, validation of HTML and links, presence
of search keywords and document descriptions. Obviously, these are the
tests that present the best cost/benefit ratio as they are easy to implement
and accurate, in the sense that they rarely fail (missing actual faults
– false negatives -- or identifying non-existing faults -- false positive).

There are areas in the table that are poorly covered: “consistency”
(n. 1), “contextual navigation” (n. 4) and “clear and meaningful labels”
(n. 6). The tests encompassed within these items are clearly more difficult
to implement than the previously discussed ones. Furthermore they are also
less accurate, as related to properties that are somewhat external ones:
consistency, clarity, meaningfulness are like beauty, in the eyes of the
beholder. Nonetheless these tests could be considered as heuristic tools,
highlighting aspects that are a potential problem. By adopting proper ranking
strategies, these aspects can be shown to the tool user without necessarily
overloading him or her.

4. The test effectiveness problem

While these tools offer a test suite that is reasonably wide and open,
at the moment there is no standard way to assess usability of the tools
themselves. This is particularly true for their effectiveness, that is
how accurate are the tests that they run. Determining the means to measure
and evaluate test effectiveness is an important requirement, both from
research and pragmatic viewpoints. In fact, a standard tool evaluation
methodology:
could be used to assess validity of each test and consequently each
tool;

could be used to compare effectiveness of different tools;

could be used to define standard levels of effectiveness, that might
then automatically reflect on standard usability levels of websites that
have been passed through certified tests;

could provide insights for a proper interpretation of the results produced
by tests (what can be the consequences of the problems identified and fixed
by tools).

The research on web usability and accessibility guidelines [WAI, 1999;
Scapin et al., 2000] is a first step towards such a methodology. But more
is needed to define a proper methodology.

An evaluation methodology, given the fast evolution pace of web technologies
and uses, can probably be only based on experiments comparing test results
with results obtained through other usability evaluation methods, namely
usability inspection methods and user testing.

It should specify a set of tests (by identifying possible usability
failures and related faults), how test effectiveness is to be measured
and how the experiment should be performed (what kind of user testing,
what kind of questionnaires or data acquisition methods should be adopted,
etc.) in order to be valid. The Goal-Question-Metrics approach [Fenton
and Lawrence Pfleeger, 1997] could be followed as a framework to define
such a methodology.

Notice that even though many tests are likely to yield false positives,
the major consequence of this is a reduced productivity of the maintainer
(that has to cope with incorrect information). In my view, it is more important
to define effectiveness in terms of the number of false negatives, that
is cases where the automatic tool was not able to identify a fault that
was instead uncovered by other means.

Test sites could be set up where specific faults are injected with the
purpose of exercising certain tests. Tools then could be evaluated on the
basis of the number of faults that they uncover.

5. Conclusions

In this paper a brief survey of automatic usability evaluation tools for
websites has been presented. These tools consider a large set of properties
depending on attributes of websites only (and not on the context in which
websites are used, thus not considering its contents).

Expecially those supporting repair actions (in addition to identification
of usability faults) have the potential to dramatically reduce the time
and effort needed to perform maintenance activities.

Several tests are still uncovered even though it seems that they are
viable with currently available technology. In other cases, in order to
be able to advance the state of the art in automatic usability evaluation,
the test effectiveness problem needs to be formulated and solved. This
is the problem of defining a standard methodology for evaluating the effectiveness
of these tools. This in turn requires that appropriate models for usability
are defined.

1
The tool list is based on a subjective selection of the tools that are
described in the web and that appear to offer significant evaluation services
(as of end of May 2000).

2
The test list is compiled on the basis of information about the tools gathered
from the web in May 2000; I considered only the tests that can be performed
automatically. In many cases a test belongs to more than one category:
I listed it in the category that I believe is more fitting.