Onionoo feature: family ID

Description

Whereas each relay is uniquely identified (by its fingerprint), families do not have unique identifiers. We want to change that. As it is currently unclear whether family ids will be implemented (prop 242), we (Virgil Griffith and Sean Saito) propose the stopgap solution that, at least within Onionoo, each family is assigned a unique positive integer (UUID) derived from the "first_seen" field of the oldest relay within its "effective_family" (symmetric family). If two families have exactly the same oldest "first_seen" field, the families are assigned UUIDs based on asciibetical order of the oldest relay's fingerprint.

How it works
To compute the UUIDs, Onionoo would iterate over all effective families and record the earliest "first_seen" value within each effective family. The effective families are then sorted by their first_seen value (ties broken by asciibetical order), and assigned an index (UUID) starting from 1. This UUID number will be invariant under relays leaving / being added to the family. When a new effective_family is discovered it is assigned the next UUID.

Effective families
Relays of an effective family are defined as a set of relays where every relay has a
symmetric connection to every other relay, or, in other words, every relay points to every other relay. Work is currently being done (ticket #16276) to resolve asymmetric connections between relays (where one relay points to a relay but that relay does not point back).

Finally, upon receiving a request with a specific relay's fingerprint, Onionoo would return:

The symmetric family, a.k.a. the "effective family" of that relay. If there is none returns an empty list.

Before we discuss possible implementations of such a family ID, which is certainly algorithmically fun, I'd first want to talk more about the intended use case. What do you need a unique family ID for? Onionoo already has a family parameter that accepts a fingerprints and returns that relay plus all relays in that relay's family that also list that relay in their family. What you could do in your Roster application is accept any relay fingerprint of a user's relay family and simply hand that fingerprint over to Onionoo. Then you wouldn't have to introduce the concept of family IDs at all.

Now to a possible implementation of such a family ID, assuming there's indeed a valid use case for it: I don't think we can implement the UUID schema you describe. One issue is that two Onionoo instances, one of which having intermittend network problems, may have a slightly different view on the Tor network and therefore assign UUIDs differently. And what if a relay that is around for years is suddenly reconfigured to join a family? And more generally, what if two families are merged, or if a family falls apart? I'm afraid we'd find many more edge cases where the proposed numbering schema would fail as soon as we start implementing it. The concept seems too fragile to me.

There's also a problem with the definition of "effective family". You write that "Relays of an effective family are defined as a set of relays where every relay has a symmetric connection to every other relay, or, in other words, every relay points to every other relay." But that's not the case right now. Consider a relay A with family A, B, C. If B's family is A, B, and C's family is A, C, we'd right now say that A's effective family is A, B, C. Maybe that means that our current definition of "effective family" is flawed.

Here's an alternative idea which is, admittedly, not fleshed out yet. Can we combine relay fingerprints of family members somehow into a family fingerprint and use that to identify the family? Of course, that family fingerprint would change once family members leave or new members join, but maybe we can use some kind of distance metric and identify the family that is closest to that fingerprint: "The family fingerprint XY is unknown, but there are two families with similar fingerprints XZ and WY." I could trying out bit-wise operations or probabilistic data structures for family fingerprints and then using something like Hamming distance for the comparison.

Another option to consider here is to just go ahead and implement proposal 242. This might be a good opportunity to do the research and the coding, and it's not only Roster/Onionoo users that would benefit from that, but all relay family operators.

A relay might be part of multiple families, so it would have to return 0 to N family IDs?

The issue is that asymmetric connections (where relay A points to relay B but not the other way around) do exist. This does not mean that a relay belongs to multiple families, rather, based on some test scripts I've run, the operator seems to have forgotten to refer a new relay to all its family members.

Once this is implemented (thanks to karsten), an effective family will essentially be a graph theoretical perfect graph, where every pair of nodes has a symmetric connection. This would preclude any chances of a relay to be connected to two separate families, since an effective family can then be verified using the handshake theorem.

A relay might be part of multiple families, so it would have to return 0 to N family IDs?

The issue is that asymmetric connections (where relay A points to relay B but not the other way around) do exist. This does not mean that a relay belongs to multiple families, rather, based on some test scripts I've run, the operator seems to have forgotten to refer a new relay to all its family members.

I'm not referring to asymmetrically declared families (which I do not consider families at all) but multiple overlapping families that share some of their relays.

A relay might be part of multiple families, so it would have to return 0 to N family IDs?

The issue is that asymmetric connections (where relay A points to relay B but not the other way around) do exist. This does not mean that a relay belongs to multiple families, rather, based on some test scripts I've run, the operator seems to have forgotten to refer a new relay to all its family members.

Once this is implemented (thanks to karsten), an effective family will essentially be a graph theoretical perfect graph, where every pair of nodes has a symmetric connection. This would preclude any chances of a relay to be connected to two separate families, since an effective family can then be verified using the handshake theorem.

I'm afraid there's some confusion: #16276 adds the list of relays that are in a mutual family relationship with a given relay, but that doesn't mean that other members in that family must be in a mutual family relationship with each other.

I think what you envision is a "perfect family" where all members are in mutual family relationships with each other. But that's not what will be implemented.

Let me suggest something else: rather than hoping that I'll implement the right thing in Onionoo, let's first perform an analysis that takes, say, all Tor descriptors published in 2015 as input and computes families and family identifiers and evaluates how useful they are for your purpose. And once we come up with definitions that work well for you, I'll implement them in Onionoo. What do you think?

Let me suggest something else: rather than hoping that I'll implement the right thing in Onionoo, let's first perform an analysis that takes, say, all Tor descriptors published in 2015 as input and computes families and family identifiers and evaluates how useful they are for your purpose. And once we come up with definitions that work well for you, I'll implement them in Onionoo. What do you think?

Agreed. I'll devise a nice family definition for Roster's purposes and when those are settled will resume this discussion.

I'm afraid there's some confusion: #16276 adds the list of relays that are in a mutual family relationship with a given relay, but that doesn't mean that other members in that family must be in a mutual family relationship with each other.

I think what you envision is a "perfect family" where all members are in mutual family relationships with each other. But that's not what will be implemented.

I understand now. I apologize for the confusion.

Let me suggest something else: rather than hoping that I'll implement the right thing in Onionoo, let's first perform an analysis that takes, say, all Tor descriptors published in 2015 as input and computes families and family identifiers and evaluates how useful they are for your purpose. And once we come up with definitions that work well for you, I'll implement them in Onionoo. What do you think?