Alice has a set $S$ of words. Bob has a set $T$ of words. They want to compute the intersection $S \cap T$ of their words, with the help of a semi-trusted third party Trent. Trent runs a central server. The server is normally well-intentioned and not malicious. However, we are concerned about the risk of server compromise: a hacker might be able to break into the server, download all data stored on the server, and even control the server for a limited period of time. Our primary security goal is to make sure that an attacker who manages to compromise the server cannot learn the set $S$ or $T$. Also, neither Alice nor Bob should need to receive a copy of the other person's set (apart from the intersection).

I am aware of a super-simple protocol for this problem. Trent picks a random symmetric key $k$ for a pseudorandom function (PRF). Alice and Bob apply the PRF with key $k$ to each of their words and upload the result to Trent's server. In other words, Alice computes the set $S^* = \{F_k(s) : s \in S\}$ locally and uploads $S^*$ to the server, and Bob uploads $T^* = \{F_k(t) : t \in T\}$ to the server. Now the server can help them find the intersection: the server computes $S^* \cap T^*$ and sends it to Alice and Bob, and this will be enough for each of them to recover the intersection $S \cap T$. This protocol is practical and has the benefit of being very easy to explain. It is basically an application of (keyed) one-way hashing. And, as long as the key $k$ is not stored on the central server, compromise of the central server does not lead to a violation of the confidentiality goals.

My question: Is this super-simple protocol approximately optimal? Or is there some other protocol that provides even better security? I'm familiar with all of the sophisticated protocols for private set intersection. Do any of them offer any security advantages for this particular setting? I'm most interested in practical benefits, rather than theoretical/foundational considerations.

This arises from a real-world problem involving data matching (matching of voter registration lists between multiple states, if that's relevant).

I guess the elements of the sets are rather small and guessable? Otherwise Bob and Alice could prior to applying the PRF compute deterministic commitments to each of their elements, e.g. hash each one. Then the adversary even getting access to the key will only learn the commitments (hashes). But this is easy to brute force for guessable elements.
–
DrLecterMar 11 '14 at 8:02

@DrLecter, yes, the elements have relatively low entropy (say, 10-40 bits). Therefore, applying a deterministic unkeyed one-way hash function would be highly insecure, because it's so easy to brute-force them, as you say. The use of a (keyed) PRF helps with this particular problem -- but maybe there is an even better or more secure solution?
–
D.W.Mar 11 '14 at 8:19

ok, my idea was to have two layers, first hash and then apply the PRF. Anyways, that does not help in this setting. I have to think about it :)
–
DrLecterMar 11 '14 at 8:22

I also guess you assume that Alice and Bob have no shared secret right?
–
DrLecterMar 11 '14 at 8:28

@DrLecter, if it would help for Alice and Bob to have a shared secret, feel free to assume they have one. That's not unreasonable.
–
D.W.Mar 11 '14 at 8:53

1 Answer
1

Your simple approach is not bad, but you might consider these modifications:

First, you don't need a PRF, any form of hashed key or a simple hash over the concatenation of a key and the element should be enough. Basically any one-way function over elements and some sort of key should do the trick, and you can optimize for speed.

The key is not chosen by Trend but Alice and Bob run a key exchange. There is no need for Trend to know the key at all. That means he can determine if two elements are equal but not get the element itself. With a keyed hash, even knowledge of the possible elements does not help if the server is compromised.

Both Alice and Bob send the hashed set to Trend, preferably already ordered to reduce the complexity of finding all equal hash values.

Trend sends the list of equal hash values back to both of them.

In a second step you can either use Trend again or leave it up to Alice and bob to make sure that they really have the same elements and not just the same hash value (different elements might have the same hash value). If you use Trend, Alice and Bob can run another key exchange and this time use symmetric encryption with this key and send the elements to Trend.