As the train I'm on ambles its unheated way through the
unseasonably
Wintry English countryside, it's time for another
“weekly”
exciting
entrepreneurial update. Actually I should be properly
working,
not just talking about working, but there's a file I need
for that
elsewhere, and three's
mobile
Internet coverage evaporates about 3 minutes outside
Waterloo station
– if only there were a
company dedicated to bettering mobile data
infrastructure... So,
here I am, with means, motive and opportunity to write a
diary entry.

Since I last wrote, I have fought with R's handling of
categorical
variables in linear models; the eventual outcome was a score
draw.
The notion of a contrast is a useful one; very often,
when we
have a heap of conditions under which we observe some value,
what
we're interested in is not so much the predicted value given
some
condition, but the difference between the value under some
condition
and the value under some other: the canonical example for
this is
probably the difference between the condition of some group
receiving
a trial treatment, and the group receiving a control or
placebo: the
default contrast for unordered categorical variables in R is
called
the treatment contrast (contr.treatmen
t).

In my particular case, I wanted to know the difference
between any
particular contrast and the average response – none of
the
categories I had in my system should have been privileged
over any of
the others, and there wasn't anything like a
“control”
group, so
comparing against the overall average is a reasonable thing
to want to
do, and indeed it is supported in R through the use of the
sum
contrast contr.sum. However, this reveals a slight
technical
problem: the overall average and differences for each
categorical
variable is one more variable than the (effective) number of
values;
just as in simultaneous equations, this is a Bad Thing.
(Technically,
the system becomes undetermined.) So, in solving the
system, one of
the differences is jettisoned; my problem was that I wanted
to
visualise that information for all the differences, whether
or not the
last one was technically redundant – particularly
since I wanted
to offer a guideline as to which differences were most
strongly
different from the average, and I would be out of luck if
the most
unusual one happened to be the one jettisoned. Obviously I
could
trivially compute the last difference, simply from the
constraint that
all the differences must sum to zero (and actually dummy.
coef
does that for me); but what about its standard error?

Enter se.co
ntrast.
This operator allows the user to construct an arbitrary
contrast,
expressed most simply as a vector of contributions to that
contrast
and ask an aov object for the standard error of
that
contrast. Some experimentation later, for a linear model
m
for len observations, and a particular factor
variable
f, and a function class.ind to construct a
matrix of
class indicator values (i.e. for a vector vi of
observations, construct a matrix xij where
xij
is 1 if observation i came from condition j, and zero
otherwise), I
think that:

gives me a vector ses of the standard errors
corresponding to
the sum contrasts in my system, including the degenerate
one. (As
seems to be standard in this kind of endeavour, the effort
per net
line of code is huge; please do not think that I wrote these
five
lines of code off the top of my head. Thanks to denizens of
the r-help
mailing list and in particular to Greg Snow for his answer
to my question about this).

So, this looks like total victory! Why have I described
this as only
a score draw? Well, because while the above recipe works
for a single
factor variable, in the case I am actually dealing with I
have all
sorts of interaction terms between factors, and between
factors and
numerical variables, and again I want to display and examine
all the
contrasts, not just some subset of them chosen so that the
system of
equations to solve is nondegenerate. This looked
sufficiently
challenging, and the analysis to be done looked sufficiently
peripheral to the current business focus, that it's been
shelved,
maybe for a rematch in the new year.