Attack Resistant Trust Metric Metadata HOWTO

3 Jul 2002:Advogato
is now running an implementation of the generic metadata engine
described below. The code is available at the mod_virgule
CVS, hosted on casper.

Please note that PageRank is covered by US Patent 6,285,999. The implementation linked below is for
research only. Seriously.

1 Apr 2002

Don't worry about people stealing your ideas. If your ideas are any good, you'll have to ram them down people's throats.

-- Howard Aiken
quoted by Ken Iverson quoted by Jim Horning, 1979

I've done a fair amount of thinking about attack-resistant trust
metrics as part of my PhD research. I did a testbed implementation,
Advogato, and now feel that the
potential applications for this technology are much wider. I'm very
busy with real work, though, and much as I would love to see my newer
ideas implemented, I don't have time right now to do it myself.

Hence, I am writing this document to entice others with community
websites into implementing some of the ideas, especially the ones
having to do with generalized metadata.

Background

You don't have to be a mathematician or trust researcher to
do a good job implementing a trust metric. However, understanding the
basic ideas really helps.

The best writeup of Advogato's existing trust metric is fc.ps, my draft submission to
the FC '00 conference. This describes the trust metric implemented in
Advogato in fairly good detail.

For my newer thinking, my thesis-in-progress is the
best bet. I am trying to make the writing more accessible than your
average PhD thesis, so please don't be too intimidated. The metadata
work is in Chapter 6.

I can also heartily recommend the original PageRank
paper. It's fairly easy to read, and PageRank itself kicks ass.

For insight into actual implementations of trust metrics, I've
split out testbed implementations of both Advogato's trust metric and
PageRank into a small tarball. This
testbed should also be fairly fun to play with.

You have to understand basic graph theory: nodes and edges,
predecessor and successor sets, indegree and outdegree. If your
interest is implementing a trust metric, rather than trying to analyze
it, you don't have to understand network flows, random walks,
or eigenvectors. Of course, it never hurts to understand more.

User-visible aspects

The core of any trust metric is peer certifications. In
a community website context, these are assertions by users that
other users are c00l d00ds. In the graph model, each user is a node,
and a peer certification by A that B is c00l is an edge from A to B.

In Advogato, these peer certifications are the sole input to the
trust metric. The output is a simple yes/no for each user. (Actually,
Advogato implements three levels, but these are just three runs of the
underlying algorithm). For the purposes of Advogato, that works pretty
well, but here I'm going to concentrate on the problem of generalized
metadata.

In this context, the user input consists of "assertions" in
addition to the peer certifications. It's entirely up to you to
determine the schema for these assertions. I'll give some concrete
examples to get you started.

Songs

One form of assertion is: "song X is Y out of 10". A key concept is
whether two assertions are mutually inconsistent: "X is 4 out of 10"
and "X is 7 out of 10" match a template and are inconsistent, where "X
is 4 out of 10" and "Y is 7 out of 10" don't match and are perfectly
consistent.

The big challenge here is identifying songs. Should "X" be an
ASCII string containing the name of the song? A cryptographic hash of
the .ogg file of the song? A URL where the song can be downloaded? All
of these are reasonable, and have their own tradeoffs.

Note that the links between these identifiers can also be metadata
assertions. Thus, all of these are reasonable:

Song hashing to 01234567ABCDEF is "Smoke on the Water"

"Song hashing to 01234567ABCDEF can be downloaded from http://mp3server/...

"Smoke on the Water" can be downloaded from http://mp3server/...

"Smoke on the Water" has hash 01234567ABCDEF

http://mp3server/... is "Smoke on the Water"

http://mp3server/... hashes to 01234567ABCDEF

Note that the 2nd and 6th can be more-or-less automatically
determined, so probably don't need manual entry.

Other applications

There are plenty of other applications. Writing them up remains a
TODO.

Computing the trust metric

There is a very large space of trust metrics you could compute. Here,
I'll narrow it down to just one. This one is appropriate for
implementing on a community Web server. It's also eigenvector-based,
like PageRank. The rest of the space is worth exploring, but probably
would take more foo.

The goal of the algorithm is to compute a "confidence value" for each
user and each metadata assertion. That tells you right there where the
scaling problems are: the product of number of users and number of
assertions can't grow too much. If the number of users is small, or
the number of assertions is small, you're probably ok.

The algorithm starts with a crude approximation of these confidence
values, then refines them iteratively. Actually, it doesn't matter
what approximation you start with, it will converge on the same result
in the end. The number of iterations needed depends on the exact data,
but high dozens or low hundreds sounds right for typical usage.

Now for a little notation. Let R[i,j] be the confidence that user i
has in assertion j. For efficient implementation, it probably helps
for i and j to be numbers, but that's not absolutely required.

In the "step", we compute a new R'[i,j] based on the old R[i,j], the
trust graph, and the assertions local to the users i.

For each i:

Find the successors of i (ie, the people i has issued a peer
certificate for). Determine i's outdegree (ie the number of nodes in
this set). You probably want to reject i->i edges at this point.

For each assertion j:

If i has a local assertion matching j, set R'[i,j] to 1 if it's an
exact match, or 0 if it just matches the template for the assertion
(ie if the assertion is inconsistent with j).
Otherwise, compute the average of R[s,j] for all successors s of i.
Multiply by a "damping factor", typically 0.85. This is the new value
of R'[i, j]. If i has no successors, use zero.

Optionally, divide each R'[i,j] by the sum over j of R'[i,j] (if
zero, use zero). This normalization step reduces the influence of
people who issue lots and lots of local assertions. Whether or not
it's a good idea isn't obviously clear, and is probably dependent on
the specifics of the application.

Now, copy R'[i,j] to R[i,j] for all i, j.

After on the order of 100 iterations of this step, R[i,j] contains the
confidence value for assertion j by user i. You can then display this
information in user i's custom home page. Obviously, the details on
how to present this information are up to you.

Some practical issues

You will care about the performance of this computation. Google has
a massive cluster to compute their PageRank algorithm over a graph
of roughly 3 billion nodes and 20 billion edges. Advogato uses a much
smaller graph (a few thousand nodes) but has its own issues.

Generally, here are the parts you need to worry about:

1. Storing the peer certifications and local assertions.

2. Retrieving these into RAM.

3. Computing the trust metric.

4. Storing the results.

5. Retrieving the results for presentation.

The implementation of Advogato has some lessons. The big performance
bottlenecks were (2) and (5). I expected the trust metric computation
itself to dominate, but in fact it's trivial - about 180 ms to compute
over the Advogato graph.

Advogato stores an XML file (called "profile.xml") for each user. This
file contains the peer certifications and other info. Thus, step (2)
requires reading thousands of XML files. On a cold cache, this takes
about a minute. The trust graph, utterly uncompressed, is about 2
megabytes. I estimate that very simpleminded compression could bring
it down to 200k or so. bzip2 gets it down to 130k. Obviously, if the
trust graph were maintained in a single file, the step of reading it
into memory would be much faster. On the other hand, it would
complicate the step of updating the certs, and I'd also be concerned
about fragility.

Advogato also used to store the results in an XML file. This file is
read into RAM before rendering any page - it is used, among
other things, to assign colors to user id's. In XML format, which is
fairly verbose, it took a good fraction of a second and a megabyte to
do this. Since Advogato serves on the order of one page per second on
average, this was quite a bottleneck. Now the tmetric results are
written out in a simple ASCII format, and no longer represent a
performance bottleneck.

The Advogato trust metric itself is written in C, which helps with the
speed. Writing the inner loops in an interpreted language like Python
could be a problem - I've noticed it's about two orders of magnitude
slower than optimized C for basic array manipulation.

In the case of Python, the NumPy extension may offer a happy medium
between ease of programming and performance. An alternate formulation of the trust metric
algorithm may be more efficient in this case, because the inner loops
are straightforward vector add and scale.

Another optimization

If your assertions are of the form "X ranks Y out of 10", then you
have 10 assertions for each X. If you're only going to report the mean
value of these assertions, you can reduce the storage to 2. For each
X, let R[i,X0] be the total confidence value for i's ranking of X.
Let R[i, X1] be the mean ranking scaled by R[i,X0]. If i has a local
ranking of X, then R[i,X0] is 1, and R[i,X1] is that ranking.

To report the rankings, R[i,X0] is the confidence value, and
R[i,X1]/R[i,X0] is the mean ranking.

For the rocket scientists, this is basically a first-moment transform.
If you care about mean and standard deviation, I believe a
second-moment transform will do the trick. If this confuses you, don't
worry; it's not really needed.

Copyright and patent

All my ideas and algorithms on trust metrics are in the public domain,
and will remain so. I tend to release my trust metric code under GPL,
but if you're seriously working on a system with a different license
and would find the code useful, I'm open to releasing it under a
different license.

There may be others with patents on trust metrics. In particular, Google
is probably getting patents on PageRank. The trust metric presented here
is fairly different from PageRank, so there's a good chance it can be
implemented freely. In any case, if you're implementing trust metrics
for research purposes, patents don't apply.

What next?

The next step, should you choose to accept it, is up to you. If you
really want to implement a trust metric, and have trouble
understanding this HOWTO, let me know, and I'll try to flesh it out.