I'd like to find a way to anonymize IP addresses in web logs to ensure user privacy given the following requirements:

If given an IP address, there is no way to look up which requests came from the IP address.

Given the set of requests, it can be determined which requests came from the same IP address.

This way no one (including an attacker) can determine the original IP addresses from the log. An attacker (anyone with access to the log) would be able to determine which IPs are distinct, though.

I'm envisioning something like the this. Each request is sent through a function that creates a unique identifier for each IP address (notice that request 1 and 4 are from the same IP address, but result in different hashes).

Request 1 - 192.168.0.1 => EIDKSJD

Request 2 - 192.168.0.2 => GIEKDJS

Request 3 - 192.168.0.3 => CIDJSDJ

Request 4 - 192.168.0.1 => DJCUDJS

But there is another function that takes two outputs and determines if they came from the same IP.

same_parent(EIDKSJD, GIEKDJS) returns false

same_parent(EIDKSJD, DJCUDJS) returns true

This method has it's downsides, specifically that searching for duplicates is not efficient. If there's a method that is efficient for searching and meets criteria 1 and 2 above, I would up-vote twice if I could :).

Can you please specify who you want to be able (and unable) to perform tasks 1 and 2? Should an attacker be able to determine if two requests came from the same IP, and do you want to be able to map request to IP address (and only you), or nobody at all?
–
ThomasJun 10 '13 at 14:10

Thanks Thomas, I edited the question with this - "This way no one (including an attacker) can determine the original IP addresses from the log. An attacker (anyone with access to the log) would be able to determine which IPs are distinct, though."
–
Nathan MurrayJun 10 '13 at 15:22

Your proposed solution for 2. breaks 1. An attacker would simply run the IP-address through the function and invoke the "same_parent" functionality on the resulting string and an entry from the log...
–
MaeherJun 10 '13 at 20:39

4 Answers
4

In general, against an attacker that has fully compromised your webserver, your full requirements are unsatisfiable.

The reason is that, in order to find out which requests have been made from a given IP address, the attacker can always simply fool your server into thinking it has received a request from that address, and then compare the resulting log entry to the others. (Of course, there may be more direct ways to achieve the same result, but this is a generic attack that works against any scheme.)

Conversely, against attackers who don't have full access to your webserver, but only to the logs, there's a simple solution: store a secret key on the server (in some way that will keep it out of the attacker's hands) and tag each log entry with HMAC(key, address). Without the key, an attacker cannot distinguish HMAC from a random function; in particular, they won't be able to tell anything useful about the HMAC tag except that each IP address corresponds to a particular tag.

However, it's generally pretty hard to ensure that attackers can't get the key (and can't otherwise fool the server into logging a fake request) even if they can somehow get access to the database. Just storing the key in, say, a secure hardware security module isn't enough, since the generic fake request attack still applies, so basically you need to harden your entire webserver.

You'll also want to make sure that your server isn't vulnerable to IP spoofing, e.g. by making sure that your TCP sequence numbers are unpredictable, and that requests are only logged after packets have been exchanged both ways between the client and the server. Otherwise that would provide yet another way to create fake log entries without having to directly compromise the webserver at all.

You can use a deterministic encryption scheme (which doesn't achieve perfect secrecy by its definition of encrypting two equal plaintexts into the same ciphertexts) as is defined for deduplication:

You hash the ip address. $H(ip)$

You encrypt using the key $H(H(ip))$

Anyone now cannot see the ip address that he packet comes from but can determine if are equal. Security flaws have been realized for this scheme

In this work the authors kind of "randomize" the plaintext by transferring the deterministic scheme from the plaintext to a tag. It's not very clear yet what extra security it achieves. It goes like this:

Each user one chooses a different key $L$

They compute $H(message)=K$

They send to the server:1) $C_1=E(L,M)$ and 2) $C_2=L \oplus K$ in that form $C_1 \oplus C_2\oplus H(K)$

Now the plaintexts are randomized in the server space but the server based on the tag can determin equalities in the plaintext space

The deterministic encryption scheme doesn't meet the first requirement - given an IP address, an attacker can compute the ciphertext and then determine what requests came from that address.
–
MichaelJun 10 '13 at 16:28

You can authorize only owners of a secret key to encrypt with this key E(H(H(key||ip))
–
curiousJun 10 '13 at 19:01

1

The space of IPv4 is so small that I don't see how any deterministic scheme could be secure. I think the assumption that the attacker will have compromised the key is unreasonable.
–
Steve ClayJun 10 '13 at 19:43

If you are willing to split the two requirements so as not to occur at the same place at the same time, there are some options available.

Option 1

Route the IP traffic through a translator on a reverse-proxy machine you can't access, and perhaps self-erases when the honeypot is triggered.

Option 2

Pre-generate a large collection of public key pairs. Store only the public keys on the server; store the private keys in a bank vault or similar. Encrypt each address instance with a different public key; regardless of IP. The key pair's uuid is recorded in the log to allow mapping to the private keys when needed.

Option 3

You may only need IP address log consistency over a short timeframe, as most serious events coming from the same IP occur over short timeframes. This case, simply hash the addresses with a salt you don't record and routinely regenerate this salt at the desired interval (e.g. 48 hours). This way, attackers of your logs only have a 48 hour window to match up IPs and can only do so with those IPs visiting in the current 48 hour window.

Longer answer: if wanting to just hash IPs this might not be good idea because IPv4 is just a 32-bit space so it would be trivial brute-force this. In practice it this quite difficult to achieve both accountability and privacy and usually a tradeoff must be chosen.

Random permutation maps each IP address to another address at random.
This allows some type of analysis, as we could determine if two hosts
in different records are the same.

They also point out the rather easy to understand fact

For fields 32 bits or smaller, dictionary attacks on hashes become
very practical. For example, it is well within the capability of a
modern adversary to create a table of hashes for every possible IPv4
address. The space of possible values being hashed is too small. Thus,
hash functions must be used carefully and only when the possibility of
collision in the mappings is acceptable.

For each HTTP client that followed the \test me" link at panopticlick.
eff.org , we recorded the ngerprint, as well as a 3-month persistent
HTTP cookie ID (if the browser accepted cookies), an HMAC of the IP
address (using a key that we later discarded), and an HMAC of the IP
address with the least signicant octet erased