What do you guys think about Honza's proposed solution in comment 1?
Previously, we have already established the pattern that, if we want the Socket Transport Thread and the main thread to be synced, then we must first block the socket transport thread and then execute the main thread option. This is what CertErrorRunnableRunnable (not a typo) and other code does.
If nsHttpConnectionMgr were to follow this pattern, it would send an event to the socket transport service thread, and then use SyncRunnableBase (or similar) to execute the stuff on the main thread that must be executed on the main thread.
If Honza's proposed solution will not work, then I have another potential solution that would work, but which would be more effort. This would require all of the following:
(a) Dropping support for main-thread-only (i.e. JS) nsICertBadCertListener2 and nsISSLErrorListener. Every single nsICertBadCertListener2 and nsISSLErrorListener in mozilla-central and comm-central does "return true;" to suppress the prompts that got turned off by default in bug 682329. We can just stop calling bad cert handlers and SSL error listeners completely without having any notable negative effect on Firefox or Thunderbird themselves. The following addons on AMO would be affected:
ExQuilla for Microsoft Exchange (for Thunderbird)
MiTM Me - alternate (click-through) cert error UI
Lightning for Thunderburd -- emulating
Thunderbird's unsafe cert error exception
dialog box.
KeeFox (A password manager)--AFACT, this one is
*completely* and silently vulnerable to
MITM attacks because of how it implements
nsIBadCertListener2.
Maybe Chatzilla (didn't read the code)
Obviously, the authors of these extensions want to keep their custom override behavior. At the same time, they are all implementing behavior that we (Mozilla) already rejected as unsafe. (In fact, the interface change from nsIBadCertListener to nsIBadCertListener2 was done explicitly to avoid the behavior that these addons are implementing.)
(b) Dropping support for main-thread-only nsICertOverrideService. Our built-in implementation is already thread-safe. I verified on mxr.mozilla.org/addons that no extensions on AMO would be affected by this. The main effect of making this change is that addons like Convergence and Selenium would have to replace their JS-based implementations with native code implementations. IMO, this is not an unreasonable requirement, but bz objected to it before. (I am willing to literally write the C++ code for them.)
(c) Fixing 679144
(d) Reworking the logic that handles PKCS#11 login, so that we don't need SyncRunnableBase in PK11PasswordPrompt; e.g. by having the user log in with the master password before we start any SSL connections in FIPS mode. (This would solve other problems too, like bug 584014 and related bugs.)

(In reply to Brian Smith (:bsmith) from comment #2)
> If Honza's proposed solution will not work, then I have another potential
> solution that would work, but which would be more effort. This would require
> all of the following:
>
> (a) Dropping support for main-thread-only (i.e. JS) nsICertBadCertListener2
> and nsISSLErrorListener.
I have figured out a way to avoid this, I think, that doesn't require changing our support for nsICertBadCertListener2 and nsISSLErrorListener.
> (b) Dropping support for main-thread-only nsICertOverrideService.
Similarly, I was just wrong about this. It is a non-issue, AFAICT.
> (c) Fixing 679144
This will happen anyway.
> (d) Reworking the logic that handles PKCS#11 login, so that we don't need
> SyncRunnableBase in PK11PasswordPrompt; e.g. by having the user log in with
> the master password before we start any SSL connections in FIPS mode. (This
> would solve other problems too, like bug 584014 and related bugs.)
This would be the main remaining issue to solving this by reworking code in PSM.

>> (d) Reworking the logic that handles PKCS#11 login, so that we don't need
>> SyncRunnableBase in PK11PasswordPrompt; e.g. by having the user log in with
>> the master password before we start any SSL connections in FIPS mode. (This
>> would solve other problems too, like bug 584014 and related bugs.)> This would be the main remaining issue to solving this by reworking code
> in PSM.
In English: This would be the main issue blocking the solving of this by reworking code in PSM (vs. spinning the event loop).

(In reply to Brian Smith (:bsmith) from comment #3)
> I have figured out a way to avoid this, I think, that doesn't require
> changing our support for nsICertBadCertListener2 and nsISSLErrorListener.Bug 754374.
> > (d) Reworking the logic that handles PKCS#11 login, so that we don't need
> > SyncRunnableBase in PK11PasswordPrompt; e.g. by having the user log in with
> > the master password before we start any SSL connections in FIPS mode. (This
> > would solve other problems too, like bug 584014 and related bugs.)
>
> This would be the main remaining issue to solving this by reworking code in
> PSM.
The PKCS#11 password prompt is a modal dialog that should (AFAICT) block us from shutting down. That is bad UI, but AFAICT it means it is a non-issue for this bug.

I saw this several times, with exactly the same OS X stack trace as in MattN's attachment. Most of the times were while shutting down to install Aurora updates, but the most recent time was immediately after waking up from sleep.

Created attachment 667096[details][diff][review]
remove GetPreviousCert to avoid most common case of deadlock
Honza, here's the patch I wrote for this. It removes the "previous cert" caching.
I don't know yet what the performance impact of this would be. The case where it would be problematic is where the top-level document is EV and the session has been resumed. In that case, the SecureBrowserUI and/or browser.js will eventually call "IsExtendedValidation()" on the main thread, which will do a certificate validation, which can block on network I/O and/or disk I/O.
Note that this problematic case can happen anyway, because this optimization only works (if ever) in cases where you navigate from one page to another on the same site. And, it's only for EV, AFAICT.

(In reply to Brian Smith (:bsmith) from comment #15)
> Created attachment 667096[details][diff][review]
> remove GetPreviousCert to avoid most common case of deadlock
>
> Honza, here's the patch I wrote for this. It removes the "previous cert"
> caching.
>
> I don't know yet what the performance impact of this would be. The case
> where it would be problematic is where the top-level document is EV and the
> session has been resumed. In that case, the SecureBrowserUI and/or
> browser.js will eventually call "IsExtendedValidation()" on the main thread,
> which will do a certificate validation, which can block on network I/O
> and/or disk I/O.
>
> Note that this problematic case can happen anyway, because this optimization
> only works (if ever) in cases where you navigate from one page to another on
> the same site. And, it's only for EV, AFAICT.
Sorry, I missed this review requirement first.
Brian, when you start working on a bug or steal it from somebody (w/o asking him, btw) then at least please change the assignee field to let people around you know.
Then, why aren't you fixing this bug according my comment 1 that would be a general fix?

Created attachment 668452[details][diff][review]
v1
- ::Shutdown() no longer indefinitely waits for monitor notification
- introduced a new event handler called on the main thread that sets a flag ; posted from OnMsgShutdown() handler on the socket thread
- ::Shutdown() loops with ProcessNextEvent until the flag is set
(-w patch for simpler review coming)

(In reply to Honza Bambas (:mayhemer) from comment #17)
> Brian, when you start working on a bug or steal it from somebody (w/o asking
> him, btw) then at least please change the assignee field to let people
> around you know.
>
> Then, why aren't you fixing this bug according my comment 1 that would be a
> general fix?
This patch is in my queue for an unrelated reason. It just happens to resolve the most common cause of this bug. I just posted it to be helpful. I'm not trying to steal the bug.

(In reply to Brian Smith (:bsmith) from comment #20)
> This patch is in my queue for an unrelated reason. It just happens to
> resolve the most common cause of this bug. I just posted it to be helpful.
> I'm not trying to steal the bug.
Aha, then sorry for my overreaction.
Do you intend to remove more of this sync stuff? If so, then I think it would be good to move it to a different bug. For this bug, so close to release bounce, I think it is safer to take a simpler and more general patch in Necko then touch relatively non-transparent and sensitive PSM code.

(In reply to Honza Bambas (:mayhemer) from comment #21)
> Do you intend to remove more of this sync stuff? If so, then I think it
> would be good to move it to a different bug. For this bug, so close to
> release bounce, I think it is safer to take a simpler and more general patch
> in Necko then touch relatively non-transparent and sensitive PSM code.
If you think that spinning the event loop is not going to be problematic, then I agree that is a reasonable short-term solution.
Regardless of this bug, I want to remove GetPreviousCert, because it can actually be bad for performance (due to event dispatching/queuing overhead), because of the nsIDocShell dependency it adds, and because there is a better, more effective way to cache results. However, that better way to cache results requires several more patches to implement.
My intention is to eventually remove all the SyncRunnableBase stuff from PSM. I have a plan for that, which we should discuss. However, it requires make significant changes, including even UI changes for master password. So, definitely we shouldn't block progress on this bug on that.

(In reply to Brian Smith (:bsmith) from comment #22)
> (In reply to Honza Bambas (:mayhemer) from comment #21)
> > Do you intend to remove more of this sync stuff? If so, then I think it
> > would be good to move it to a different bug. For this bug, so close to
> > release bounce, I think it is safer to take a simpler and more general patch
> > in Necko then touch relatively non-transparent and sensitive PSM code.
>
> If you think that spinning the event loop is not going to be problematic,
> then I agree that is a reasonable short-term solution.
I've been checking the code and it is clear we do the event loop spin short after manager shutdown anyway. During the looping I've introduced the manager is already in an unconditional shutdown state (there is no ref to sts that prevents using the manager mostly). So I don't feel there would be a regression unless well hidden.
And I also think this way of shutdown is more proper then indefinite waiting for a monitor notification. PSM may not be the only code that blocks the socket thread by a sync event for the main thread.
>
> Regardless of this bug, I want to remove GetPreviousCert, because it can
> actually be bad for performance (due to event dispatching/queuing overhead),
> because of the nsIDocShell dependency it adds, and because there is a
> better, more effective way to cache results. However, that better way to
> cache results requires several more patches to implement.
>
> My intention is to eventually remove all the SyncRunnableBase stuff from
> PSM. I have a plan for that, which we should discuss. However, it requires
> make significant changes, including even UI changes for master password. So,
> definitely we shouldn't block progress on this bug on that.
OK, I agree.

Is the reviewed patch covering the complete fix here? If so, please get this tested and landed to trunk. To take this fix on Beta we'll want it in earlier so there's time to watch for any regressions. That means having this assessed for the risks of uplift in the coming week.

(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #32)
> Is there anything in particular QA can do to verify this is fixed? Any
> advice on exploratory regression testing we can perform?
I don't think this is simple to reproduce. You could try visit an https page during shutdown, but the race is so close you probably won't be lucky.
I'd just leave this be verified by feedback from users.

So I still see this issue with Thunderbird 17.0 beta. Today I got an upgrade notice to the latest beta. After applying the downloaded update I triggered a restart and now a thunderbird process is hanging around for about 15 minutes. This is on OS X 10.7.5.

I killed the process already. But I will remember and do it when I see it the next time. It's something which happens occasionally after a couple of days using hibernation mode. So it will take a bit until I can come back with further information.