Data analysis as a discourse

At the Libre Graphics Meeting 2008 in Wroclaw, just
before Michael Terry presents his project ingimp to an audience of curious
GIMP developers and users, we meet up to talk more about
‘instrumenting GIMP’ and about the way Terry thinks data
analysis could be done as a form of discourse. Michael Terry is a computer
scientist working at the Human Computer Interaction Lab of the University
of Waterloo, Canada and his main research focus is on improving usability
in Open Source software. We speak about ingimp, a clone of the popular
image manipulation programme GIMP, but with an important difference:
ingimp allows users to record data about their usage in to a central
database, and subsequently makes this data available to anyone. This
conversation was also published in the Constant publication Tracks in
electr(on)ic fields.

Maybe we could start this conversation with a description of
the ingimp project you are developing and why you chose to work on
usability for GIMP?

So the project is ‘ingimp’, which is an
instrumented version of GIMP, it collects information about how the
software is used in practice. The idea is you download it, you install it,
and then with the exception of an additional start up screen, you use it
just like regular Gimp. So, our goal is to be as unobtrusive as possible
to make it really easy to get going with it, and then to just forget about
it. We want to get it into the hands of as many people as possible, so
that we can understand how the software is actually used in practice.
There are plenty of forums where people can express their opinions about
how GIMP should be designed, or what’s wrong with it, there are
plenty of bug reports that have been filed, there are plenty of usability
issues that have been identified, but what we really lack is some
information about how people actually apply this tool on a day to day
basis. What we want to do is elevate discussion above just anecdote and
gut feelings, and to say, well, there is this group of people who appear
to be using it in this way, these are the characteristics of their
environment, these are the sets of tools they work with, these are the
types of images they work with and so on, so that we have some real data
to ground discussions about how the software is actually used by people.
You asked me now why GIMP? I actually used GIMP extensively for my PhD
work. I had these little cousins come down and hang out with me in my
apartment after school, and I would set them up with GIMP, and quite often
they would always start off with one picture, they would create a sphere,
a blue sphere, and then they played with filters until they got something
really different. I would turn to them looking at what they had been doing
for the past twenty minutes, and would be completely amazed at the results
they were getting just by fooling around with it. And so I thought, this
application has lots and lots of power, I’d like to use that power
to prototype new types of interface mechanisms. So I created JGimp, which
is a Java based extension for the 1.0 GIMP series, that I can use as a
back-end for prototyping novel user interfaces. I think that it is a great
application, there is a lot of power to it, and I had already an
investment in its code base so it made sense to use that as a platform for
testing out ideas of open instrumentation.

What is special about ingimp, is the fact that the data you
generate is made by the software you are studying itself. Could you
describe how that works?

Every bit of data we collect, we make available: you can go
to the website, you can download every log file that we have collected.
The intent really is for us to build tools and infrastructure so that the
community itself can sustain this analysis, can sustain this form of
usability. We don’t want to create a situation where we are creating
new dependencies on people, or where we are imposing new tasks on existing
project members. We want to create tools that follow the same ethos as
Open Source development, where anyone can look at the source code, where
anyone can make contributions, from filing a bug to doing something as
simple as writing a patch, where they don’t even have to have access
to the source code repository, to make valuable contributions. So
importantly, we want to have a really low barrier to participation. At the
same time, we want to increase the signal-to-noise ratio. Yesterday I
talked with Peter Sikking, an information architect working for GIMP, and
he and I both had this experience where we work with user interfaces, and
since everybody uses an interface, everybody feels they are an expert, so
there can be a lot of noise. So, not only did we want to create an open
environment for collecting this data, and analysing it, but we also want
to increase the chance that we are making valuable contributions, and that
the community itself can make valuable contributions. Like I said, there
is enough opinion out there. What we really need to do is to better
understand how the software is being used. So, we have made a point from
the start to try to be as open as possible with everything, so that anyone
can really contribute to the project.

ingimp has been running for a year now. What are you
finding?

I have started analysing the data, and I think one of the
things that we realised early on is that it is a very rich data set; we
have lots and lots of data. So, after a year we’ve had over 800
installations, and we’ve collected about 5000 log files,
representing over half a million commands, representing thousands of hours
of the application being used. And one of the things you have to realise
is that when you have a data set of that size, there are so many different
ways to look at it that my particular perspective might not be enough.
Even if you sit someone down, and you have him or her use the software for
twenty minutes, and you videotape it, then you can spend hours analysing
just that twenty minutes of videotape. And so, I think that one of the
things we realised is that we have to open up the process so that anyone
could easily participate. We have the log files available, but they really
didn’t have an infrastructure for analysing them. So, we created
this new piece of software called 'StatsJam', an extension to MediaWiki,
which allows anyone to go to the website and embed SQL-queries against the
ingimp data set and then visualise those results within the Wiki text. So,
I’ll be announcing that today and demonstrating that, but I have
been using that tool now for a week to complement the existing data
analysis we have done. One of the first things that we realized is that we
have over 800 installations, but then you have to ask, how many of those
are really serious users? A lot of people probably just were curious, they
downloaded it and installed it, found that it didn’t really do much
for them and so maybe they don’t use it anymore. So, the first thing
we had to do is figure out which data points should we really pay
attention too. We decided that a person should have saved an image, and
they should have used ingimp on two different occasions, preferably at
least a day apart, where they’d saved an image on both of the
instances. We used that as an indication of what a serious user is. So
with that filter in place, then the '800 installations' drops down to
about 200 people. So we had about 200 people using ingimp, and looking at
the data this represents about 800 hours of use, about 4000 log files, and
again still about half a million commands. So, it’s still a very
significant group of people. 200 people is still a lot, and that’s a
lot of data, representing about 11000 images they have been working on,
there’s just a lot.

From that group, what we found is that use of ingimp is really short
and versatile. So, most sessions are about fifteen minutes or less, on
average. There are outliers, there are some people who use it for longer
periods of time, but really it boils down to them using it for about
fifteen minutes, and they are applying fewer than a hundred operations
when they are working on the image. I should probably be looking at my
data analysis as I say this, but they are very quick, short, versatile
sessions, and when they use it, they use less than 10 different tools, or
they apply less than 10 different commands when they are using it. What
else did we find? We found that the two most popular monitor resolutions
are 1280 by 1024 and 1024 by 768. So, those represent collectively 60% of
the resolutions, and really 1280 by 1024 represents pretty much the
maximum for most people, although you have some higher resolutions. So one
of the things that’s always contentious about GIMP, is its window
management scheme and the fact that it has multiple windows, right? And
some people say, well you know this works fine if you have two monitors,
because you can throw out the tools on one monitor and then your images
are on another monitor. Well, about 10% to 15% of ingimp users have two
monitors, so that design decision is not working out for most of the
people, if that is the best way to work. These are things I think that
people have been aware of, it’s just now we have some actual
concrete numbers where you can turn to and say, now this is how people are
using it. There is a wide range of tasks that people are performing with
the tool, but they are really short, bursty tasks.

Every time you start up ingimp, a screen comes up asking you
to describe what you are planning to do and I am interested in the kind of
language users invent to describe this, even when they sometimes
don’t know exactly what it is they are going to do. So inventing
language for possible actions with the software, has in a way become a
creative process that is now shared between interface designer, developer
and user. If you look at the ‘activity tags’ you are
collecting, do you find a new vocabulary developing?

I think there are 300 to 600 different activity tags that
people register within that group of ‘significant users’. I
didn’t have time to look at all of them, but it is interesting to
see how people are using that as a medium for communicating to us. Some
people will say, Just testing out, ignore this! Or, people are
trying to do things like insert HTML code, to do like a cross-site
scripting attack, because, you have all the data on the website, so they
will try to play with that. Some people are very sparse and they say
‘image manipulation’ or ‘graphic design’ or
something like that, but then some people are much more verbose, and they
give more of a plan, This is what I expect to be doing. So, I
think it has been interesting to see how people have adopted that and
what’s nice about it, is that it adds a really nice human element to
all this empirical data.

I wanted to ask you about the data, without getting too
technical, could you explain how these data are structured, what do the
log files look like?

So the log files are all in XML, and generally we compress
them, because they can get rather large. And the reason that they are
rather large is that we are very verbose in our logging. We want to be
completely transparent with respect to everything, so that if you have
some doubts or if you have some questions about what kind of data has been
collected, you should be able to look at the log file, and figure out a
lot about what that data is. That’s how we designed the XML log
files, and it was really driven by privacy concerns and by the desire to
be transparent and open. On the server side we take that log file and we
parse it out, and then we throw it into a database, so that we can query
the data set.

Now we are talking about privacy… I was impressed by
the work you have done on this; the project is unusually clear about why
certain things are logged, and other things not; mainly to prevent the
possibility of ‘playing back’ actions so that one could
identify individual users from the data set. So, while I understand there
are privacy issues at stake I was wondering… what if you could look
at the collected data as a kind of scripting for use? Writing a
choreography that might be replayed later?

Yes, we have been fairly conservative with the type of
information that we collect, because this really is the first instance
where anyone has captured such rich data about how people are using
software on a day to day basis, and then made it all that data publicly
available. When a company does this, they will keep the data internally,
so you don’t have this risk of someone outside figuring something
out about a user that wasn’t intended to be discovered. We have to
deal with that risk, because we are trying to go about this in a very open
and transparent way, which means that people may be able to subject our
data to analysis or data mining techniques that we haven’t thought
of and extract information that we didn’t intent to be recording in
our file, but which is still there. So there are fairly sophisticated
techniques where you can do things like look at audio recordings of typing
and the timings between keystrokes, and then work backwards with the
sounds made to figure out the keys that people are likely pressing. So,
just with keyboard audio and keystroke timings alone you can often give
enough information to be able to reconstruct what people are actually
typing. So we are always sort of weary about how much information is in
there. While it might be nice to be able to do something like record
people’s actions and then share that script, I don’t think
that that is really a good use of ingimp. That said, I think it is
interesting to ask, could we characterize people’s use enough, so
that we can start clustering groups of people together and then providing
a forum for these people to meet and learn from one another? That’s
something we haven’t worked out. I think we have enough work cut out
for us right now just to characterize how the community is using it.

It was not meant as a feature request, but as a way to
imagine how usability research could flip around and also become
productive work.

Yes, totally. I think one of the things that we found when
bringing people into to assess the basic usability of the ingimp software
and ingimp website, is that people like looking at things like what
commands other people are using, what the most frequently used commands
are, and part of the reason that they like that, is because of what it
teaches them about the application. So they might see a command they were
unaware of. So we have toyed with the idea of then providing not only the
command name, but then a link from that command name to the documentation
– but I didn’t have time to implement it, but certainly there
are possibilities like that, you can imagine.

Maybe another group can figure something out like that?
That’s the beauty of opening up your software plus data set of
course. Well, just a bit more on what is logged and what not… Maybe
you could explain where and why you put the limit and what kind of use you
might miss out on as a result?

I think it is important to keep in mind that whatever
instrument you use to study people, you are going to have some kind of
bias, you are going to get some information at the cost of other
information. So if you do a video taped observation of a user and you just
set up a camera, then you are not going to find details about the monitor
maybe, or maybe you are not really seeing what their hands are doing. No
matter what instrument you use, you are always getting a particular slice.
I think you have to work backwards and ask what kind of things do you want
to learn. And so the data that we collect right now, was really driven by
what people have done in the past in the area of instrumentation, but also
by us bringing people into the lab, observing them as they are using the
application, and noticing particular behaviours and saying, hey, that
seems to be interesting, so what kind of data could we collect to help us
identify those kind of phenomena, or that kind of performance, or that
kind of activity? So again, the data that we were collecting was driven by
watching people, and figuring out what information will help us to
identify these types of activities. As I’ve said, this is really the
first project that is doing this, and we really need to make sure we
don’t poison the well. So if it happens that we collect some bit of
information, that then someone can later say, Oh my gosh, here is the
person’s file system, here are the names they are using for the
files or whatever, then it’s going to make the normal user
population weary of downloading this type of instrumented application.
This is the thing that concerns me most about Open Source developers
jumping into this domain, is that they might not be thinking about how you
could potentially impact privacy.

I don’t know, I don’t want to get paranoid. But
if you are doing it, then there is a possibility someone else will do it
in a less considerate way.

I think it is only a matter of time before people start
doing this, because there are a lot of grumblings about, we should be
doing instrumentation, someone just needs to sit down and do it. Now
there is an extension out for Firefox that will collect this kind of data
as well, so you know…

Maybe users could talk with each other, and if they are
aware that this type of monitoring could happen, then that would add a
different social dimension…

It could. I think it is a matter of awareness, really, so
when we bring people into the lab and have them go to the ingimp website,
download and install it and use it, and go check out the stats on the
website, and then we ask questions like, what kind of data are we
collecting? We have a lengthy concern agreement that details the type of
information we are collecting and the ways your privacy could be impacted,
but people don’t read it.

So concretely… what information are you recording, and
what information are you not recording?

We record every command name that is applied to a document,
to an image. Where your privacy is at risk with that, is that if you write
a custom script, then that custom script’s name is going to be
inserted into a log file. And so if you are working for example for Lucas
or DreamWorks or something like that, or ILM, in some Hollywood movie
studio and you are using ingimp and you are writing scripts, then you
could have a script like 'fixing Shrek’s beard', and then that is
getting put into the log file and then people are going to know that the
studio uses ingimp. We collect command names, we collect things like what
windows are on the screen, their positions, their sizes, we take hashes of
layer names and file names. We take a string and then we create a hash
code for it, and we also collect information about how long is this
string, how many alphabetical characters, numbers, things like that, to
get a sense of whether people are using the same files, the same layer
names time and time again, and so on. But this is an instance where our
first pass at this, actually left open the possibility of people taking
those hashes and then reconstructing the original strings from that.
Because we have the hash code, we have the length of the string, all you
have to do is generate all possible strings of that length, take the hash
codes and figure out which hashes match. And so we had to go back and
create a new scheme for recording this type of information where we create
a hash and we create a random number, we pair those up on the client
machine but we only log the random number. So, from log to log then, we
can track if people use the same image names, but we have no idea of what
the original string was. There are these little 'gotchas', things to look
out for, that I don’t think most people are aware of, and this is
why I get really concerned about instrumentation efforts right now,
because there isn’t this body of experience of what kind of data
should we collect, and what shouldn’t we collect.

As we are talking about this, I am already more aware of
what data I would allow to be collected. Do you think by opening up this
data set and the transparent process of collecting and not collecting,
this will help educate users about these kinds of risks?

It might, but honestly I think probably the thing that will
educate people the most is if there was a really large privacy error and
that it got a lot of news, because then people would become more aware of
it because right now – and this is not to say that we want that to
happen with ingimp – but when we bring people in and we ask them
about privacy, Are you concerned about privacy?, and they say
No, and we say Why? Well, they inherently trust us, but
the fact is that Open Source also lends a certain amount of trust to it,
because they expect that since it is Open Source, the community will in
some sense police it and identify potential flaws with it.

Is that happening?Are you in dialogue with the Open
Source community about this?

No, I think probably five to ten people have looked at the
ingimp code – realistically speaking I don’t think a lot of
people looked at it. Some of the GIMP developers took a gander at it to
see how could we put this upstream, but I don’t want it upstream,
because I want it to always be an opt-in, so that it can’t be turned
on by mistake.

You mean you have to download ingimp and use it as a
separate program? It functions in the same way as GIMP, but it makes the
fact that it is a different tool very clear.

Right. You are more aware, because you are making that
choice to download that, compared to the regular version. There is this
awareness about that. We have this lengthy text based consent agreement
that talks about the data we collect, but less than two percent of the
population reads license agreements. And, most of our users are actually
non-native English speakers, so there are all these things that are
working against us. So, for the past year we have really been focussing on
privacy, not only in terms of how we collect the data, but how we make
people aware of what the software does. We have been developing wordless
diagrams to illustrate how the software functions, so that we don’t
have to worry about localisation errors as much. And so we have these
illustrations that show someone downloading ingimp, starting it up, a
graph appears, there is a little icon of a mouse and a keyboard on the
graph, and they type and you see the keyboard bar go up, and then at the
end when they close the application, you see the data being sent to a web
server. And then we show snapshots of them doing different things in the
software, and then show a corresponding graph change. So, we developed
these by bringing in both native and non-native speakers, having them look
at the diagrams and then tell us what they meant. We had to go through
about fifteen people and continual redesign until most people could
understand and tell us what they meant, without giving them any help or
prompts. So, this is an ongoing research effort, to come up with
techniques that not only work for ingimp but also for other
instrumentation efforts, so that people can become more aware of the
implications.

Can you say something about how this type of research
relates to classic usability research and in particular to the usability
work that is happening in Gimp?

Instrumentation is not new, commercial software companies
and researchers have been doing instrumentation for at least ten years,
probably ten to twenty years. So, the idea is not new but what is new, in
terms of the research aspects of this, is how do we do this in a way where
we can make all the data open? The fact that you make the data open,
really impacts your decision about the type of data you collect and how
you are representing it. And you need to really inform people about what
the software does. But I think your question is… how does it impact
the GIMP’s usability process? Not at all, right now. But that is
because we have intentionally been laying off to the side, until we got to
the point where we had an infrastructure, where the entire community could
really participate with the data analysis. We really want to have this to
be a self-sustaining infrastructure, we don’t want to create a
system where you have to rely on just one other person for this to
work.

What approach did you take in order to make this project
self-sustainable?

Collecting data is not hard. The challenge is to understand
the data, and I don’t want to create a situation where the community
is relying on only one person to do that kind of analysis, because this is
dangerous for a number of reasons. First of all, you are creating a
dependency on an external party, and that party might have other
obligations and commitments, and might have to leave at some point. If
that is the case, then you need to be able to pass the baton to someone
else, even if that could take a considerate amount of time and so on. You
also don’t want to have this external dependency, because of the
richness in the data, you really need to have multiple people looking at
it, and trying to understand and analyse it. So how are we addressing
this? It is through this StatsJam extension to the MediaWiki that I will
introduce today. Our hope is that this type of tool will lower the barrier
for the entire community to participate in the data analysis process,
whether they are simply commenting on the analysis we made or taking the
existing analysis, tweaking it to their own needs, or doing something
brand new.

In talking with members of the GIMP project here at the Libre Graphics
Meeting, they started asking questions like, So how many people are
doing this, how many people are doing this and how many this?
They’ll ask me while we are sitting in a café, and I will be
able to pop the database open and say, A certain number of people have
done this, or, no one has actually used this tool at all.
The danger is that this data is very rich and nuanced, and you can’t
really reduce these kind of questions to an answer of N people do
this, you have to understand the larger context. You have to
understand why they are doing it, why they are not doing it. So, the data
helps to answer some questions, but it generates new questions. They give
you some understanding of how the people are using it, but then it
generates new questions of, Why is this the case? Is this because
these are just the people using ingimp, or is this some more widespread
phenomenon? They asked me yesterday how many people are using this colour
picker tool – I can’t remember the exact name – so I
looked and there was no record of it being used at all in my data set. So
I asked them when did this come out, and they said, Well it has been
there at least since 2.4. And then you look at my data set, and you
notice that most of my users are in the 2.2 series, so that could be part
of the reasons. Another reason could be, that they just don’t know
that it is there, they don’t know how to use it and so on. So, I can
answer the question, but then you have to sort of dig a bit deeper.

You mean you can’t say that because it is not used, it
doesn’t deserve any attention?

Yes, you just can’t jump to conclusions like that,
which is again why we want to have this community website, which shows the
reasoning behind the analysis. Here are the steps we had to go through to
get this result, so you can understand what that means, what the context
means, because if you don’t have that context, then it’s sort
of meaningless. It’s like asking, what are the most frequently used
commands? This is something that people like to ask about. Well really,
how do you interpret that? Is it the numbers of times it has been used
across all log files? Is it the number of people that have used it? Is it
the number of log files where it has been used at least once? There are
lots and lots of ways in which you can interpret this question. So, you
really need to approach this data analysis as a discourse, where you are
saying, here are my assumptions, here is how I am getting to this
conclusion, and this is what it means for this particular group of people.
So again, I think it is dangerous if one person does that and you become
to rely on that one person. We really want to have lots of people looking
at it, and considering it, and thinking about the implications.

Do you expect that this will impact the kind of interfaces
that can be done for GIMP?

I don’t necessarily think it is going to impact
interface design, I see it really as a sort of reality check: this is how
communities are using the software and now you can take that information
and ask, do we want to better support these people or do we…For
example on my data set, most people are working on relatively small images
for short periods of time, the images typically have one or two layers, so
they are not really complex images. So regarding your question, one of the
things you can ask is, should we be creating a simple tool to meet these
people’s needs? All the people are is just doing cropping and
resizing, fairly common operations, so should we create a tool that strips
away the rest of the stuff? Or, should we figure out why people are not
using any other functionality, and then try to improve the usability of
that? There are so many ways to use data I don’t really know how it
is going to be used, but I know it doesn’t drive design. Design
happens from a really good understanding of the users, the types of tasks
they perform, the range of possible interface designs that are out there,
lots of prototyping, evaluating those prototypes and so on. Our data set
really is a small potential part of that process. You can say, well
according to this data set, it doesn’t look like many people are
using this feature, let’s not much focus too on that, let’s
focus on these other features or conversely, let’s figure out why
they are not using them…Or you might even look at things like how
big their monitor resolutions are, and say well, given the size of the
monitor resolution, maybe this particular design idea is not feasible. But
I think it is going to complement the existing practices, in the best
case.

And do you see a difference in how interface design is done
in free software projects, and in proprietary software?

Well, I have been mostly involved in the research community,
so I don’t have a lot of exposure to design projects. I mean, in my
community we are always trying to look at generating new knowledge, and
not necessarily at how to get a product out the door. So, the goals or
objectives are certainly different. I think one of the dangers in your
question is that you sort of lump a lot of different projects and project
styles into one category of 'Open Source'. 'Open source' ranges from
volunteer driven projects to corporate projects, where they are actually
trying to make money out of it. There is a huge diversity of projects that
are out there; there is a wide diversity of styles, there is as much
diversity in the Open Source world as there is in the proprietary world.
One thing you can probably say, is that for some projects that are
completely volunteer driven like GIMP, they are resource strapped. There
is more work than they can possibly tackle with the number of resources
they have. That makes it very challenging to do interface design, I mean,
when you look at interface code, it costs you 50% or 75% of a code base.
That is not insignificant, it is very difficult to hack and you need to
have lots of time and manpower to be able to do significant things. And
that’s probably one of the biggest differences you see for the
volunteer driven projects, it is really a labour of love for these people
and so very often the new things interest them, whereas with a commercial
software company developers are going to have to do things sometimes they
don’t like, because that is what is going to sell the product.