Description

Replace TCP port with TCP port hash: It may be less obvious that TCP ports need to be sanitized, but an unusual TCP port used by a high-value bridge might still stand out and provide yet another way to locate and block the bridge.

Each non-zero TCP port is replaced with H(port | bridge identity | secret)[:2] % 65535 + 1 written as decimal number. The input port is the 2-byte long binary representation of the TCP port. The bridge identity is the 20-byte long binary representation of the bridge's long-term identity fingerprint. The secret is a 33-byte long secure random string that changes once per month for all descriptors and statuses published in that month. H() is SHA-256. The [:2] operator means that we pick the 2 most significant bytes of the result. TCP ports that are 0 in the original descriptor are left unchanged.

In order to make this change we'll need to write and test the code and re-process all bridge descriptors since 2008. The last part is going to take at least a week, maybe longer.

Huh, good point, didn't think of that. How about we make the following two changes, one related to your suggestion and one unrelated?

Take out the % 65535 + 1 part to make this calculation a little less complicated, at the risk of accidentally changing 1 in 216 ports to 0.

Add clarifying sentence: "All calculations assume that inputs and outputs are in network byte order." Does that make sense, or is there a better sentence to add here? (We'll want to add a similar sentence to the IP address sanitizing part.)

New paragraph would be:

Each non-zero TCP port is replaced with H(port | bridge identity | secret)[:2] written as decimal number. The input port is the 2-byte long binary representation of the TCP port. The bridge identity is the 20-byte long binary representation of the bridge's long-term identity fingerprint. The secret is a 33-byte long secure random string that changes once per month for all descriptors and statuses published in that month. H() is SHA-256. The [:2] operator means that we pick the 2 most significant bytes of the result. All operations assume network byte order for their inputs and outputs. TCP ports that are 0 in the original descriptor are left unchanged.

Huh, good point, didn't think of that. How about we make the following two changes, one related to your suggestion and one unrelated?

Take out the % 65535 + 1 part to make this calculation a little less complicated, at the risk of accidentally changing 1 in 216 ports to 0.

I think that a 0 port has a special meaning (not configured) and we need to preserve that.
I'm comfortable with the extra complexity. But I'm not the one who has to code or maintain it, so it's up to you.

Add clarifying sentence: "All calculations assume that inputs and outputs are in network byte order." Does that make sense, or is there a better sentence to add here? (We'll want to add a similar sentence to the IP address sanitizing part.)

Hmm, network byte order is only meaningful for integers, and it's important only when they are hashed, or otherwise interpreted as an array of bytes. So it only affects the port (16 bit integer) and IPv4 address (32 bit integer).

All the other hash inputs and outputs have a defined order already - the order in memory.

So I'm not sure if this sentence would add more confusion - maybe it's just worth clarifying the integer inputs?

New paragraph would be:

Each non-zero TCP port is replaced with H(port | bridge identity | secret)[:2] written as decimal number. The input port is the 2-byte long binary representation of the TCP port. The bridge identity is the 20-byte long binary representation of the bridge's long-term identity fingerprint. The secret is a 33-byte long secure random string that changes once per month for all descriptors and statuses published in that month. H() is SHA-256. The [:2] operator means that we pick the 2 most significant bytes of the result. All operations assume network byte order for their inputs and outputs. TCP ports that are 0 in the original descriptor are left unchanged.

Alright, I came up with a slightly less dirty way to avoid producing 0 as sanitized ports: H(...)[:2] / 2^2 + 2^15 + 2^15. That takes the most significant 14 bits and puts them in the range from 49152 to 65535 which is reserved for private services. One might argue that this is similar to how we pick 10.x.x.x for sanitized IPv4 addresses, but really it's just less ugly than H(...)[:2] % 65535 + 1. Yell if you dislike it.

And I made another attempt to specify the network byte order thing. Though I realized that it doesn't really matter, because people cannot reproduce our results without having our secrets anyway, so they wouldn't even find out if they're running this code on a different architecture. Still good to be exact I guess.

So, I wrote some code. Please take a look at ​this single commit in my branch task-19317. teor, please feel free to ignore the Java changes, but if you could take a look at the HTML near the end of the patch which contains the new specification, that would be grand.

How much privacy is actually gained with all this?
Reducing the range gives some privacy, but how much given that the original ports are not randomly distributed in their range?
Just point me to an appropriate doc, if this question seems trivial.

Neat, didn't know about that. I'll change that as soon as we agree that the calculation even makes sense.

How much privacy is actually gained with all this?
Reducing the range gives some privacy, but how much given that the original ports are not randomly distributed in their range?
Just point me to an appropriate doc, if this question seems trivial.

This doesn't add much privacy, it's mostly a way to avoid that we obtain a result of 0. We could have taken different approaches, e.g. | 0b0000_0000_0000_0001 or | 0b1000_0000_0000_0000 or % (2^16 - 1) + 1. But similar to how we're sanitizing IPv4 addresses to 10.x.x.x rather than x.x.x.x, I thought we could sanitize TCP ports to just 49152 to 65535 which happen to be reserved. That avoids confusions like a port below 1024 which might show up in Atlas and confuse bridge operators. And I thought it's better to take the first 14 bits of the hash output rather than bits 2 to 16, therefore the / 2^2. But really, the main goal is not to obtain a 0 as result. And reducing the range of possible ports to 1/4 doesn't seem like a big issue to me. Hope this makes sense. Open to alternatives that make even more sense!

In general, I think what is defined above is ok. Some some port changes could be lost, but that's ok for the sanitation objective.
There are surely many ways to derive the obfuscated port number, we just have to choose one and the above seems fine.

I would suggest to also change the comments in code and default properties to state that both ip and port will be hashed.

And, should the unhashed default of port 1 be changed to some number above 1024, b/c of Atlas?

Some asides about the SanitzedBridgeWriter:
Now looking at the entire class I noticed first off all that I missed some hard-coded paths when doing the configuration change. => New issue #19424, which has time to wait for the port sanitation to finish.

Second, it might be better to access the secrets file in try-with-resources statements, maybe using the java7 idiom of Files.newBufferedReader in java.nio.file.

Please review the last two commits in ​my branch task-19317-2. The first commit (​ecb0538) has the same code as the ​commit in my earlier task-19317 branch, rebased to master. Note that this patch doesn't include changes based on the suggestions above, which we should make as soon as unit tests are in place. The second commit (​f608c94) splits tarballs into one per month and descriptor type. I'm already running this code in parallel to the main CollecTor instance, and I'm ready to switch over as soon as 1) the tarball samples I sent you look okay and 2) this branch is ready to be merged. By the way, I'd say we don't need to put out a release for this, but I can as well run a -dev version on the primary CollecTor instance.

The notational changes (talked about in comment:6 and comment:7) should be added to the suggestions of comment:8, which should be implemented after the tests. (I'll add a note to ticket #19755.)

The changes in the two commits look fine.

I'm wondering if there should be a more prominent warning that bridge-descriptors archive downloads will be different now? In a way the file structure for 'archive' is a 'protocol', which should be defined more clearly (eventually).

I randomly looked at the archives using command line tools; the probes were fine, tars containing only descriptors published in the given month, ports being in the above defined range.
So, I think this is fine to be used. As we have only one instance working with bridge-descriptors, I agree this can be done without a release.