Created attachment 501467[details]
Stack trace of crash
When running the Firebug 1.7 test suite, I get erratic crashes during various of the tests. It looks like it's accessing freed memory:
ContentEnumFunc (aRule=0x7fb908cd94c0, aSelector=0x7fb908cd9470, aData=0x7fff621211b0) at /home/sfink/src/TM-singlestep/layout/style/nsCSSRuleProcessor.cpp:2368
(gdb) p *aRule
$6 = {<nsICSSRule> = {<nsIStyleRule> = {<nsISupports> = {_vptr.nsISupports = 0x5a5a5a5a5a5a5a5a}, <No data fields>}, <No data fields>}, <No data fields>}
(note the 0x5a... pattern.)
Steps to reproduce:
1. Install Firebug 1.7 and FBTest 1.7. I did it by checking out (via subversion) http://fbug.googlecode.com/ and making the appropriate symlinks from my extensions/ directory:
fbtest@mozilla.com -> /home/sfink/src/fbtest1.7
firebug@software.joehewitt.com -> /home/sfink/src/firebug1.7
2. Click on the firebug icon in the bottom right. Focus on the firebug pane and press Shift-T to bring up the test console.
3. In the console, do "Run All".
For me, this goes on for a while, then at some random point it crashes with a stack trace similar to the attached one.
I have gotten the crash in net/1456, net/activation and one of the script/ ones. You may want to expand the "net" test suite and then run just it, so you can see how far it makes it.

Ok I think I fixed this one on the Firebug side:
http://code.google.com/p/fbug/source/detail?r=8842
domUtils.getCSSStyleRules has some limitations which crash Firefox unless you do a certain dance before you call it. Someone added a call to get the style rules in net.js for a UI hack, but failed to dance first. That explains why it happens in net testing and why its intermittent.
I'll leave this open until we try a bit more. If we stop seeing the crash we win.

As noted in bug 536379 comment 14, that should no longer be needed. If it is, there's a bug (which appears to be the case), but I'm not sure if the information given is enough of a testcase to be useful in fixing it.

Well, information on whether the change made in comment 4 fixes the crash or not would be a good start.
Note that depending on what Firebug is doing, if they're messing with a _ua_ sheet they would run into issues still, right? Of course then the firebug-side fix wouldn't help either.

I'm valgrinding away, using the original (known-to-crash) 1.7. Unfortunately, everything's timing out a little before 30s and failing, which may interfere with the diagnostic utility a bit.
The valgrind log shows 17 invalid reads, but they're all during startup, and I think I get them independently of firebug or fbtest. (Though I haven't confirmed, and I need to wait a while for this run before I can.) Is there a better way to valgrind? I have --enable-valgrind, and --smc-check=yes (for JaegerMonkey). Looks like it's valgrind fighting with jemalloc (see bug 503249).

Created attachment 502114[details]
valgrind log of net/ section of tests
It made it all the way through the whole net/ section without crashing, but every test timed out. I just noticed the "No Timeout" button at the top, so I'm trying again with that.
I'm still running with jemalloc, so there's some noise. The startup stuff is over by line 442. The locations in the log are all QI-related, which is where the actual crash is happening, too.

Created attachment 502121[details]
Valgrind log of net/ section of Firebug tests, run twice
It finally managed to crash at the same place as the real (non-valgrind) run, with this message:
##!!! ASSERTION: Please fix QI so this performance optimization is valid: 'static_cast<nsIStyleRule*>(aRule) == iRule.get()', file /home/sfink/src/TM-singlestep/layout/style/nsCSSRuleProcessor.cpp, line 2370
I was a little surprised, because it was just sitting in the XHR test. I thought it would stay there forever, because I had turned off timeouts and XHR debugging is broken right now such that it'll never get the XHR response callback. But it did crash, yay. (Not sure if it made it past that test or not; after it crashes, you can't see the status window anymore and I didn't do the config that would dump it out to an fd.)
I've attached the full valgrind log, which is the same as the other but with more stuff.

Created attachment 502153[details]
valgrind log of net/ section of tests
Bleh. I recompiled with --disable-jemalloc and reran, getting the same crash. This time, the valgrind log showed no more than what you could figure out from attaching with gdb. Everything else seems to be a result of jemalloc.

Though I guess for only one of the addresses... the other was just a pointer to dead memory.
Plus, now the pattern is different. 0xdadadada is a PL_FREE_PATTERN/JS_FREE_PATTERN (and something msvc-ish, but you're on Linux).
The 0x5a5a5a5a pattern in comment 0 is jemalloc's freed memory marker.
For what it's worth, we're crashing enumerating the document rules (at least if your line 633 in nsStyleSet is the same as what I'm seeing here). So something is definitely fishy...

Ugh. I've rebased that repo a few times since then. It's a tracemonkey checkout, which currently has this:
(*aCollectorFunc)(mRuleProcessors[eDocSheet], aData);
following an if. Starting at line 631:
if (!skipUserStyles && !cutOffInheritance &&
mRuleProcessors[eDocSheet]) // NOTE: different
(*aCollectorFunc)(mRuleProcessors[eDocSheet], aData);
aRuleWalker->SetLevel(eStyleAttrSheet, PR_FALSE,
aRuleWalker->GetCheckForImportantRules());
if (mRuleProcessors[eStyleAttrSheet])
(*aCollectorFunc)(mRuleProcessors[eStyleAttrSheet], aData);
I will now attempt to learn enough hg arcana to see if I can return to my directory state at that time...

Doesn't matter what the directory state was. What matters is which line of that is on your callstack when you crash. But I bet it's still this line:
(*aCollectorFunc)(mRuleProcessors[eDocSheet], aData);

(In reply to comment #21)
> Doesn't matter what the directory state was. What matters is which line of
> that is on your callstack when you crash. But I bet it's still this line:
>
> (*aCollectorFunc)(mRuleProcessors[eDocSheet], aData);
Sorry, I'm dense. I don't follow what you're saying. I want to be able to say what line 633 was in the binary that I built and ran valgrind on. To figure that out, I need to know what rev was sitting in my directory at the time I built.
Or are you saying I just need to know the times that I pulled in changes to that specific file? That is true too, but I don't know how to figure that out either. (I can see timestamps for when changes to that file were committed, but not timestamps for when those changes hit my local repo, nor for when I updated my working directory with those changes.) But I'm probably misinterpreting you.
Anyway, I still don't know how to figure it out properly, but fortunately I have logging hooks that give me a log saying my previous pull was of 8e119f847f97ba1e29da8192ca5fc93248e6c960, which has the same code at line 633
as I pasted above.
That should be the right rev, but my log doesn't show which repo I pulled into, so I'm not 100% certain.

Heh. I guess I could have checked that first.
Just wanted to mention that I just did another valgrind run on roughly the same test suite for another issue, and oddly it managed to make it much farther without crashing. The valgrind log shows an unrelated invalid read (during GC triggered by CC), but nothing about this. It may not mean much, since the problem is sporadic to begin with, but I think I do have some firebug changes this time around so the problem may be legitimately masked right now. (I have the older version lying around, so I can repeat if needed.)
Let me know if I you need anything else from me to help.

In my experience and based on Steve's result, you should be able to save some time by clicking on the Net test collection.
Also you might want to remove the lines:
// This call must precede all getCSSStyleRules calls
Firebug.CSSModule.cleanupSheets(hrefLabel.ownerDocument, Firebug.currentContext);
from net.js, since these were added to workaround this bug, but Boris' thinks they do nothing. So nothing is probably better.

So far I've run all tests once under valgrind, and then net tests again. (The second time (the net tests only) I'd removed the lines in comment 28.)
I did this without crashing, and without any valgrind warnings that seem relevant to this crash bug (though I did discover bug 625249, which I saw on both runs; I don't *think* it's related to this bug).
I've been running on a clean build of
http://hg.mozilla.org/mozilla-central/rev/f9537a28ce19 with the line
"#define DEBUG_TRACEMALLOC_PRESARENA 1" in layout/base/nsPresArena.h uncommented.

I'm still getting it easily with 4f71ecca94fe (tracemonkey), firebug1.7 revision 8916, fbtest 8835.
I'm trying now with firebug1.7 8935, fbtest 8935.... yep, still crashes. (Takes only about 15 seconds or so.)
I am on linux64. Perhaps that matters?
I'm doing a build of the same thing on my desktop box. If I can reproduce there, I can give you a login.
<groan> No, it made it all the way through there. Ok, I'll poke around...

It's not the build; I fortunately have identical directory structures so I rsync'ed over the same bits and it still didn't crash.
It's something in the profile. When I copied that over, it started crashing. I may have some extra Firebug tracing output. I will narrow that down.
The crash is very slightly different -- instead of a null deref, it's a junk deref. But the stack and object involved is the same.

(In reply to comment #32)
> I'm on Linux64 as well.
>
> What's the "desktop box" that you can't repro on?
It's the same. Email me an SSH public key if you'd like access to it. (sfink@mozilla.com). I'm working on narrowing down the profile difference.

It sounds like the profile, or some piece of it, would be more useful than access to the box. (e.g., if the prefs file is sufficient to make a new profile show the problem, no need to bother reducing it to which pref...)

That's what I'm trying to get. But copying the firebug part of the prefs.js file didn't work (still didn't crash with the "bad" prefs), and copying the whole file somehow messes up fbtest. Still working on it.

Created attachment 503602[details]
Prefs file triggering crash
Well, I'm deep in "wtf?!" territory, but I was able to go from no crash -> crash with the attached prefs.js file.
I'm not sure why it messed up fbtest before, but something also deleted all files from both extensions. Odd. Anyway, can you give this prefs.js a try? I don't know if you need to tweak anything to get it to work (delete extensions.installCache, maybe?), but maybe you do?

(In reply to comment #37)
> Created attachment 503602[details]
> Prefs file triggering crash
>
> Well, I'm deep in "wtf?!" territory, but I was able to go from no crash ->
> crash with the attached prefs.js file.
Just FYI, you have FBS_CREATION set which produces a huge amount of tracing data: for every function compiled it issues about 4 records and decompiles the source twice.

I created a new profile with just released Firebug 1.7a8 and FBTest from SVN. I can't get it to crash. Added some tracing prefs, no crash yet.
Steve, I don't have the those entries in the profile that does crash.

Well I take it back. The extra loops *seem* to make it crash more in the profile where I crash. But the crash does not happen in those loops and it does not make the crash-free profile crash either.
So the profile does make a difference. Somehow.

The trigger seems to be somewhat subtle, and probably isn't directly associated with what's actually going wrong, so I doubt there's a whole lot of benefit in narrowing it down further. At this point, someone with a clue (dbaron) is in possession of both a reproducible test case and some juicy valgrind traces, so we're in fairly good shape.
dbaron, anything else we could provide that might help out?

I can repro the crash now, so I think I'm fine.
The reference counting looks ok, so it seems like a failure to clear rule cascades. That said, it goes through nsCSSStyleSheet::ReplaceStyleRule, which calls DidDirty, which clears rule cascades. So I'm not sure what went wrong.

The rule in question is in chrome://firebug/skin/net.css, which is an @import-ed sheet. The sheet seems to incorrectly have a null mParent, which makes ClearRuleCascades() ineffective. I think this may be because the way we rebuild a child list after a clone sets the child list correctly, but doesn't set the parent pointer correctly.

Except ChildSheetListBuilder should have taken care of that, so that's probably not the cause.
But, nevertheless, it has a non-null mOwnerRule and a null mParent and null mDocument.
If I follow its mOwnerRule->mSheet, it is in fact in its "parent"'s child list.

I bet the problem is that nsCSSStyleSheetInner::RemoveSheet only calls SetStyleSheetReference, which only sets the sheet on the import rule (and doesn't propagate that to setting the parent on the import rule's sheet). Then we later destroy the sheet that was removed, which nulls out the mParent on the child sheet that is really no longer its own.

I'd also note that a trick that made this substantially more easily reproducable was commenting out these lines:
@@ -1092,19 +1092,19 @@ nsStyleSet::NotifyStyleContextDestroyed(
// could be a style context from the new ruletree!
if (!aStyleContext->GetParent()) {
mRoots.RemoveElement(aStyleContext);
}
if (mInReconstruct)
return;
- if (mUnusedRuleNodeCount == kGCInterval) {
+ //if (mUnusedRuleNodeCount == kGCInterval) {
GCRuleTrees();
- }
+ //}
}

(In reply to comment #69)
> I verified via try server (using the 32-bit Linux build) that the 1.9.2 patch
> in fact fixes the crash in the crashtest.
And I did the same thing for a try server build of the 1.9.1 patch (against a current 1.9.1 nightly).
> And, for comparison, the two crashes in the current 1.9.2 nightly build that I
> got while testing the nightly and the try server build with the same steps
> were:
And the crashes from the 1.9.1 nightly were:
bp-d52db773-2733-42b6-b914-4e7f42110115bp-ee1c21a1-9d5e-4339-adc5-7120f2110115

(In reply to comment #71)
> The crash-test doesn't require Firebug or any special privs, right? (doesn't
> look like a chrome test.)
Nope, the crash test is just a Web page that crashes Firefox. That's why I want to get this fixed on branches.

The "3.6.15" we're releasing today does not fix this bug, the release containing this bug fix has been renamed to "3.6.16" and the bugzilla flags will be updated to reflect that soon. Today's release is a re-release of 3.6.14 plus a fix for a bug that prevented many Java applets from starting up.

Note

You need to
log in
before you can comment on or make changes to this bug.