Description

I am observing that my relay and bridge will update the microdesc consensus when they are restarted or catch SIGHUP, but not while they are running. In the case of the bridge, the consensus it serves eventually falls out of date, and clients that try to connect through it will hang on "I learned some more directory information, but not enough to build a circuit: We have no recent usable consensus" and never connect to the network.

The bridge and relay I seen this happening on are running 0.2.9.4-alpha on OpenBSD. I also spun up a new bridge on Debian (also running 0.2.9.4-alpha) and it appears to have the same problem. This does not appear to happen with 0.2.8.9.

What it looks like is happening:

At startup (or reload) the relay fetches the microdesc consensus

1 minute later it tries to fetch it again (update_consensus_networkstatus_downloads() is called) and receives a 304 response as it hasn't been modified

download_status_increment_failure() gets called with a status_code of 304

update_consensus_networkstatus_downloads() gets called again, this time it stops at the call to connection_dir_count_by_purpose_and_resource() which returns 1 (equal to max_in_progress_conns)

download_status_increment_failure() gets called again, this time with a status_code of 0 (as a result each 304 response results in the fail count being increased by 2)

The previous steps repeat every minute for a few minutes until the failure count reaches 10 (exceeding the max fail count of 8)

At this point it still keeps retrying every minute but download_status_is_ready() doesn't return true as the failure count exceeds the max, so it skips it without trying to fetch it

Eventually the consensus falls out of date, but download_is_ready() still won't return true so it won't try and fetch a new one

On 0.2.8.9 it makes a couple of attempts that fail with a 304 response but download_is_ready() will eventually start returning false as the value of dts->next_attempt_at is greater than the current time. It seems on 0.2.8.9 next_attempt_at is increased a lot more aggressively, first by 1 minute, then 10 minutes and then an hour, so it accumulates a failure count of 6 but then waits long enough that the next attempt succeeds.

On 0.2.9.4-alpha, it looks like the value of next_attempt_at is increased more slowly, by only seconds at a time, so it reattempts every minute and quickly reaches the failure limit.

Change History (37)

This looks very similar to my "mystery 1" in #19969 -- so I am going to bring that discussion over here so we can keep the bugs separate.

For me it happened on "Tor 0.2.9.3-alpha-dev (git-bfaded9143d127cb)" (which is alas not in a released version of Tor, because #20269 isn't merged yet, but suffice to say it's partway between 0.2.9.3 and 0.2.9.4). And this was just a client, not a bridge or relay.

teor asked in #19969 what my "consensus download_status_t has for all its fields, particularly the attempt and failure counts." Here is what gdb says:

teor also asked if my Tor has marked each of the directory authorities down. I believe the answer is yes -- all the entries in trusted_dir_servers has is_running set to 0, except the Bifroest entry (which make sense). Here is an example to be thorough:

The fallback_dir_servers smartlist has 90 elements so I didn't check all of them, but for the first few, is_running was set to 1. That is, no, my Tor seemed to think that the fallback dirs were just fine to contact. It simply chose not to contact them.

To give some other context to folks reading this ticket, here is some more debugging detail (all from my Tor client that has been opting not to retrieve a new consensus for the past week+):

I set a breakpoint on fetch_networkstatus_callback, and learned that prefer_mirrors is 1, and we_are_bootstrapping is 1. should_delay_dir_fetches() was 0. It called update_networkstatus_downloads as expected.

Then I set a breakpoint on update_consensus_networkstatus_downloads. Again we_are_bootstrapping is 1. use_multi_conn alas is <optimized out>, but looking at the function, I assume it's 1 for me. The first round through the loop, for vanilla consensus flavor, we_want_to_fetch_flavor() is no, so I move on to the second round through the loop. I don't know howc = networkstatus_get_latest_consensus_by_flavor(i); goes because "print c" also says <optimized out>, but it looks like it runs time_to_download_next_consensus[i] = now; next, so I can assume that c was NULL. Then it keeps going through the function until it calls
update_consensus_bootstrap_multiple_downloads(). Looks plausible.

So I set a breakpoint on update_consensus_bootstrap_multiple_downloads, which was trickier than I would have wanted since it looks like my compiler inlined it into update_consensus_networkstatus_downloads. But it looks like it does make two calls to update_consensus_bootstrap_attempt_downloads -- one with dls_f, and the next with dls_a.

So I set a breakpoint on update_consensus_bootstrap_attempt_downloads. It sets max_dl_tries to 7, which makes sense since I see it there in config.c, set to default to 7.

rubiate, are you able to reproduce this bug consistently? If so, can you spin up a relay or bridge with commit 09a0f2d0b24 reverted, and see how that fares? My guess is that it will fare much better.

In the mean time, I've opened #20501 to look at the Tor network for relays that were bitten by this bug (seems like quite a few), and #20509 for doing something about getting them off the network and/or taking away their Guard flag so clients don't get stuck behind them and then be unable to use Tor.

So, if we're truly on exponential backoff, no maximum could be too large, right?

Technically, yes.

But at some exponent, the wait time becomes indistinguishable from failure.
(Which is why we need to make sure requests trigger a new attempt.)

I guess this essentially implements hibernate mode then?

And we could just put the failure count up to something quite high, let's say, at most, the failure number at which tor is waiting for the average time between tor stable releases?

I also wonder, why are these failure counts so high?

Firstly, because they get incremented twice for each failure.

download_status_increment_failure() gets called with a status_code of 304
update_consensus_networkstatus_downloads() gets called again, this time it stops at the call to connection_dir_count_by_purpose_and_resource() which returns 1 (equal to max_in_progress_conns)
download_status_increment_failure() gets called again, this time with a status_code of 0 (as a result each 304 response results in the fail count being increased by 2)

10:36 Both start up
[...] both request the consensus every minute
10:41 they reach a fail count of 10

What is wrong with your relay set-up such that they both failure to get a consensus at bootstrap? :)

Are you firewalled in some weird way? Are they trying to fetch it from fallbackdirs and those are surprisingly faily? Are they trying from directory authorities and our authorities are no good?

Anyway, it looks like 'revert' is the winner, but it would still be great to learn what is so helpful about your test environment that it triggers this bug so well.

Well, they're in Australia, so latency is high, and measured bandwidth is low. But I'm not sure that would cause so many failures. Maybe something with OpenBSD?

My relay at the same provider has AccountingMax set, and has disabled its DirPort, so it's much harder to interrogate. It's on FreeBSD, but on 0.2.8.7 (still waiting for a package update), and up to date with its consensuses.

But at some exponent, the wait time becomes indistinguishable from failure.
(Which is why we need to make sure requests trigger a new attempt.)

It is good that we have the belt-and-suspenders fix in place where new client requests trigger a new attempt -- but that trick only works for clients. We should make sure that directory mirrors also have some way to reliably keep trying, and same for exit relays because of the should_refuse_unknown_exits() thing. Basically all of the reasons in directory_fetches_from_authorities().

I guess this essentially implements hibernate mode then?

And we could just put the failure count up to something quite high, let's say, at most, the failure number at which tor is waiting for the average time between tor stable releases?

It seems to me that any design that effectively has a "now you won't ask for the consensus anymore" possible outcome is a scary one here. Speaking of which. is there a place I should look to read about our current download design? I only know the one I wrote some years ago, and it looks like it's changed since then.

Firstly, because they get incremented twice for each failure.

I haven't looked into that one, but if so, can we open a new ticket for this (what looks like separate) bug?

And secondly, because the laptop was offline for 12? hours?

Actually, I think I drove my consensus download failure count up to 8 over the course of about ten minutes -- it launches each new try within a second of when the last one failed:

My laptop was closed (asleep) for more than a day, so when it woke up its consensus was more than 24 hours old, so it immediately jumped to bootstrap mode for its downloads. Ten minutes later, it had given up permanently.

But at some exponent, the wait time becomes indistinguishable from failure.
(Which is why we need to make sure requests trigger a new attempt.)

It is good that we have the belt-and-suspenders fix in place where new client requests trigger a new attempt -- but that trick only works for clients. We should make sure that directory mirrors also have some way to reliably keep trying, and same for exit relays because of the should_refuse_unknown_exits() thing. Basically all of the reasons in directory_fetches_from_authorities().

I guess this essentially implements hibernate mode then?

And we could just put the failure count up to something quite high, let's say, at most, the failure number at which tor is waiting for the average time between tor stable releases?

It seems to me that any design that effectively has a "now you won't ask for the consensus anymore" possible outcome is a scary one here. Speaking of which. is there a place I should look to read about our current download design? I only know the one I wrote some years ago, and it looks like it's changed since then.

Proposal 210 is close, but it's been modified by at least you, me, and andrea since then.

Firstly, because they get incremented twice for each failure.

I haven't looked into that one, but if so, can we open a new ticket for this (what looks like separate) bug?

My laptop was closed (asleep) for more than a day, so when it woke up its consensus was more than 24 hours old, so it immediately jumped to bootstrap mode for its downloads. Ten minutes later, it had given up permanently.

That timing is off from what I would expect - when I designed it, it was:
Fallbacks: 0, 1, 5, 16, 3600, ...
Authorities: 10, 21, 3600, ...

But if we're skipping two on every failure, it could become:
Fallbacks: 0, (1 or 5 depending on exact failure timing), 3600, ...
Authorities: 10, 3600, ...
And if the client has all the authorities as down, I guess it won't even try them.

But at some exponent, the wait time becomes indistinguishable from failure.
(Which is why we need to make sure requests trigger a new attempt.)

It is good that we have the belt-and-suspenders fix in place where new client requests trigger a new attempt -- but that trick only works for clients. We should make sure that directory mirrors also have some way to reliably keep trying, and same for exit relays because of the should_refuse_unknown_exits() thing. Basically all of the reasons in directory_fetches_from_authorities().

I guess this essentially implements hibernate mode then?

And we could just put the failure count up to something quite high, let's say, at most, the failure number at which tor is waiting for the average time between tor stable releases?

It seems to me that any design that effectively has a "now you won't ask for the consensus anymore" possible outcome is a scary one here. Speaking of which. is there a place I should look to read about our current download design? I only know the one I wrote some years ago, and it looks like it's changed since then.

I logged #20534 for this. There are existing cases where we give up forever. We should tune them to do what we think we want.

What is wrong with your relay set-up such that they both fail to get a consensus at bootstrap? :)

Ah, no, they both got a perfectly good consensus at startup. The "failures" every minute after that are from them having a fresh consensus, requesting a new one anyway and getting 304 Not Modified in response.

What is wrong with your relay set-up such that they both fail to get a consensus at bootstrap? :)

Ah, no, they both got a perfectly good consensus at startup. The "failures" every minute after that are from them having a fresh consensus, requesting a new one anyway and getting 304 Not Modified in response.

So your reverted relay is bad: it retries 5 times every hour, when it should only try once an hour.
And your bad relay is also bad: it retries never, when it should at least try once an hour.

What's the current consensus on what the minimal set of fixes for 029 are here? I'd like to do something for the next couple of days for an 0.2.9.5-alpha, even if we expect that we'll have to fine-tune it a little more for 0.2.9.6-rc.

Should we need to do all of the subtickets in 0.2.9, do you think? Or do we take a simpler approach?

I have a branch bug20499_part1_029 that I think should be sufficient to make 0.2.9 work correctly here. It solves #20534 and #20536 (and #20587 too, since it was there). The other stuff, I think, could wait to 0.3.0? My understanding is limited though.

...
It seems to me that any design that effectively has a "now you won't ask for the consensus anymore" possible outcome is a scary one here.

We should think carefully about what we want the default maximum to be, and if we want it to be the same in every case:

*max = INT_MAX;

Perhaps the network could deal with slow zombies if they only retried every day/week/month. Or perhaps we really do want them to stop asking. Or perhaps we want clients to do one thing, and relays another.

Currently, some schedules do have INT_MAX as their maximum, others have 2, 12, or 73 hours.

All the other commits look fine, but I'm not sure they're enough to solve this issue.

...
It seems to me that any design that effectively has a "now you won't ask for the consensus anymore" possible outcome is a scary one here.

We should think carefully about what we want the default maximum to be, and if we want it to be the same in every case:

*max = INT_MAX;

Perhaps the network could deal with slow zombies if they only retried every day/week/month. Or perhaps we really do want them to stop asking. Or perhaps we want clients to do one thing, and relays another.

Currently, some schedules do have INT_MAX as their maximum, others have 2, 12, or 73 hours.

My thinking here was that we're backing off to infinite retries, so we might as well let the delays get arbitrarily large. We might want to cut the maximum back down if it turns out to be a problem in practice, but I'd like to be cautious about zombies for now.

All the other commits look fine, but I'm not sure they're enough to solve this issue.

Okay. We've merged a fairly large passel of fixes to try to address this. It is reasonably likely that we will need more finetuning, so this will be one of the things that makes 0.2.9.5-alpha an alpha release.