Report from NILM2014@London on comparing NILM algorithms

The first “NILM in
London”
workshop was held on Wednesday 3rd September. In this blog post, I’d
like to try to summarise the discussion around comparing NILM
algorithms.

At present, it is very hard (if not impossible) to objectively compare
any two NILM algorithms. This is true for both academia and industry.
The problem is that each research paper tends to use a different
dataset, different metrics, different appliances etc. The situation
improved considerably in 2011 with the release of the REDD
dataset. But we are still some distance from
being able to directly compare the performance of any pair of NILM
algorithms.

Dominik Egarter raised the point: How do we compare NILM algorithms
with fundamentally different assumptions and inputs? For example,
say algorithm A requires that the use list every appliance in the
home but algorithm B requires no information from the user.
Algorithm A gives an accuracy of 85% whilst algorithm B gives an
accuracy of 75% percent. Which is better? Should the algorithm which
requires more information be penalised in some way? Is it even fair
to directly compare them? Should we define a set of ‘NILM algorithm
classes’ and only compare algorithms within their own class? We
could come up with a set of ‘NILM algorithm classes’ by considering
specific scenarios and use-cases. For example, most domestic users
probably won’t be bothered to enumerate every appliance in their
home, so we could have a ‘zero user input’ class (which does not
necessarily mean ‘unsupervised’ in the machine learning sense
because the system could access generic appliance models trained
from, say, the public datasets).

Companies offering NILM are focussed on offering a NILM service
which is satisfies their particular users’ needs. They might see
very little value in having a global ‘leader board’ of performance.
For example, when you hire a builder to modify your house, you don’t
consult some regional league table of builders. Instead you find the
local builder who can offer you everything you need, and you really
don’t care if they are a few percentage points behind some other
local builders on some particular metric.

Which metric(s) to use to compare NILM algorithms? We probably need
to use multiple metrics

Do we need to spend lots of money collecting a ‘validation dataset’.
The idea being that, if we are trying to validate commercial NILM
services, then we probably need to keep the test data private (so
people don’t cheat!) But collecting a large dataset is
very expensive. If companies are not interested in a 3rd party NILM
validation tool then perhaps we do not need to bother to collect a
new dataset. Instead, if only academics are interested in competing
on a public ‘leader board’ then we probably don’t need a private
dataset, especially if academics are encouraged to release
their code. Computer Vision competitions like the ImageNet Large
Scale Visual Recognition
Challenge use public
data (I think).

However, companies might well be interested in privately assessing
how well their algorithms perform relative to some benchmark (and/or
the academic state of the art).

In terms of ‘benchmarks’, it might be nice to explore how each
metric responds to ‘naive’ approaches. e.g. some metrics will give
surprisingly high ‘marks’ if you just predict that all appliances
are ‘off’ all the time! Or using simple ‘simulation’ using just
probability density functions of each appliance for each time
of day.

(I didn’t take notes at the meeting so I have probably forgotten some
points. Please add anything I’ve missed / garbled to the comments!)

I try to mitigate climate change using computer science. I am a Research Engineer at DeepMind, mostly working on energy problems. Previously, I worked on energy disaggregation as a post-doc at Imperial College London. Read more about me…