Social Network Signature for Entity Resolution

In the lazyweb department, I had an idea the other day that I thought I’d put out more broadly (lest someone else have the same thought, plus the thought to patent it.) And that is the idea that one side-effect of the “social graph” is to create a unique identity signature. Who my friends are can be used for entity resolution. (Background: one problem in identity is figuring out whether two people who have the same name are in fact the same person. This is complicated by variations on the name. So there’s a whole set of questions: is T. O’Reilly the same as Tim O’Reilly the same as Timothy F. O’Reilly? Are you referring to my brother Sean O’Reilly or my father Sean O’Reilly? And when you find a reference to Tim O’Reilly, is it the Tim O’Reilly who blogs here or the Sydney musician who also has a wikipedia entry, or one of the hundreds or thousands of others who have the same name.)

Typically, you resolve identity conflicts by adding additional information: a phone number, an address, a social security number.

Now, clearly, there are far more cases where you might have easier access to this kind of real-world information than you would have access to someone’s friends list from Facebook. But it’s also true that a site like Facebook could offer an identity service by which they present a unique hash of someone’s friends list at a particular point in time as a unique credential that doesn’t actually require disclosure of any confidential information.

Of course, this is just a special case of a much broader situation, namely that our “identity” is in fact a function of everything we show to the world. Mechanisms might generate credentials by hashing our purchase history at Amazon, our search history at Google, or our surfing patterns in Firefox as easily as they could hash our social network. But the point is that it is possible to generate credentials that are as unique as fingerprints.

It would be kind of cool not to have to enter passwords, but for a site to “recognize” me because I was able to present a hash of my past interactions with the site, automatically recorded by both the site and my browser.

You could think of this as a kind of public key cryptography. Your private key would be the timestamp at which the hash was created, or of the start time and end time that were used to create it. Your public key would be the hash itself.

Have been mulling the same concept in relationship to standalone RIAs. Passing authentication between client and server and how it could minimize storage and subsequent transfer of data on the server [for example, using the client for encrypted storage of purchase histories rather than such information accessible only via storage on the server], how it can be used to let the user manage control over their own data, etc. Subsequently, in how it could be used to establish authentication between the server onto a separate payment service gateway and then between the payment service gateway and the client.

I remember hearing about this sort of approach for entity resolution (wasn’t called that) back at university 10 years ago… The issue arose back then when merging databases to form data warehouses, where the various people inside the databases had a variety of information about them in a variety of schemas – it was referred to as an evidence based approach.

Interesting datapoint: evidence based approaches have grown in popularity by the looks of things in the medical field. That matches what’s happened with text mining… Arising question: what else is waiting from the medical field to re-emerge ? :-)

Mark

Check out Joshua Schachter’s (del.icio.us) use of bloom filters in LOAF:

Very cool. Here’s a related generalization: as individuals move more of their private data to various services, privacy becomes an increasing concern. I’m exploring extending public key cryptography so that the user maintains control and rights over usage of their data, even while it is useful in all the good Web 2.0 ways. The signature of the user’s relationships with various services is also a graph, so similar to your social network signature, we have a provider network signature. Maybe we can simplify identity and ensure appropriate privacy at the same time.

Very interesting and definitely something worth implementing in some form or the other. My fear is that there will be a whole bunch of people who are so far removed from such technicalities, but sufficiently active on the web, that we have to consider how to make any such service more accessible. If our goal is to make the web simpler for everyone, we need to abstract out some of the more complex technical details otherwise we will continue to have a large chunk of the web completely removed from some of the more useful work being done today

Nice. Made me think what else a social network signature would be useful for.

Could you spot which directions your employees are going? There were certainly visible changes in my public contacts and public contributions when I decided to switch from engineering to engineering management and then again when I decided to leave.

I guess I’m really thinking of developing signatures for profiling. I’m not ready to advocate for them, but it’s interesting to think about how they might be used (misused).

I had a friend recently who had a serious bought of depression that included completely withdrawing from his friends. There was a definite social signature there. Of course, I had the exact same signature for the first six months of building CrowdVine.

npdoty

When I was setting up a brokerage account recently over the phone, an automated system asked me a series of questions, informing me that all answers were publicly available. I was shocked, thinking that if it’s publicly available, how was it going to distinguish me from someone pretending to be me? But I was impressed by the effective series of questions (in which of these four towns have you owned property? in which of these four counties have you lived?).

Anyway, one of the questions asked me to identify the person I was associated with in a list of people’s names. So there are businesses taking advantage of your knowledge of friends to identify you. I’m not sure if the proliferation of social networks makes this more vulnerable to attack or easier to implement, but I really like the idea of using publicly available information of this kind as an identifier.

Here is what we are doing. We are going to tokenize all your identities one at a time (like the SHA(phone:5715551212), SHA(facebook:12312312), etc). You then upload those identity tokens to a “identity service” which provides you with a unique id as a collapsed identity. Then other people upload their hashed identity tokens. Then you get the list of your facebook friends identities and tokenize them yourself then query the identity service of the unique id for these tokens. You can then collapse all your friends identities into a unique id through this service.

Thus you will know now that the Rich Kilmer in your address book with X number is the same as the friend Rich Kilmer on facebook, etc all without anyone having to expose their identities. Of course the best place to do this is from your PC and, of course, this is one of the things we building right now.

Rich: There is a similar effort in microformats through MicroID – which may be what you were describing :)

Tim: In terms of entity resolution, a good existing means for a base vocabulary is to use Wikipedia topics and categories. That provides good mechanisms for disambiguation. Anywho, our firm uses Wikipedia URLs for entity resolution in the apophony API, which analyzes dialog in social media. This includes some social graph analysis (not shown in API notes yet).

Combining those three notions – microformats, Wikipedia vocabulary, cross-platform social graph – could provide good steps toward the kind of signatures you’re suggesting.

The only reason to hash private information, such as a friends list, is if later you will again compute a hash of a friends list and then compare the two hash values. The only purpose of the hash is to foreclose the option of repudiation: Was this your friends list? Yes? Then obviously that hash was your ID.

Where, in what you are imagining, is that second hash computation and comparison? What’s it for? Will someone build a big dictionary mapping hash keys to all observed friends lists?

IF there’s no need for that second hash then you’re better off just using random numbers.

So, I’m not too clear on the intent of the idea.

There is a good part to the idea: that the user should speak to some privately operated software (such as a browser but there are other possibilities) and that privately operated software should be an “identity broker” with services. The cookie mechanism in browsers is a crude form of this but it could be much more sophisticated, given technology that makes it easier to “compose” web services at a deep level (rather than just “on the page,” mash-up style).

-t

-t

Rich Kilmer

Paco: The point of what we are doing is collapsing a set of identity tokens into one UUID. So you will be able to ask the service:

What that means is the AIM user (richsaimaddress) is the same as the email (rich@foobar.com). The server in the middle does not ever see rich@foobar.com so has no identifying information. This allows me to collapse the varying identities I know you through into a single UUID and allows you to do the same for me (assuming we are friends on facebook, on each others buddy list, have each other in the address book, etc). Enabling this collapse means that I can collapse the address book data + facebook data + AIM data, etc I have access to into a unified view of you all on the edge.

The time factor could trip up any hashes of attributions pertaining to the individual. If the facet in question is dynamic and evolving, the hash has to be resilient enough to accommodate changes. Consequently if anything like identity disambiguation is based on this scheme, the evolving attributions should be a basic assumption.

I think this is similar to the concept of attention profile, with the difference that hashing reduces everything to a single token.

As you might expect, there’s a lot of research already being done in the areas of entity resolution and antialiasing for use by the Law Enforcement and Intelligence communities. Social networking sites have added a rich layer of information to what used to be a much harder challenge.

The only missing component is “ease of use.” Plus the classic problems with key maintenance.

Greg Elin

I always wondered about this kind of approach for authenticating people or other things face-to-face by strangers. The post describes an approach of entity resolution via a common web of relationships. But the use need not be restricted to matching two entities. For example, the assignment of a University ID to a student. How does the staff person issuing new (or replacement) ID verify the person who they claim? Verification is usually via a token — like a driver’s license — that the office staff has no real way of authenticating either. Why can’t the token be public information about the individual they could not easily fake? If the University knows your blog page and your photo is on your blog, isn’t that better token and more difficult to fake than a driver’s license? (Or why not store the original ID photos for reference?)