Issues in Layout that do not fit into any other Layout component or which span multiple Layout components.

Bugs related to the top level presentation objects (pres shell, pres context, and document viewer), the frame constructor, and the base frame classes, as well as general issues with alignment and sizing, all belong here.

re comment 1:
spun off to Bug 526592
all the bugs on the dependency tree of this bug will show up as crash regressions in 3.6 when it ships so we might want to take and bake as many of these fixes as we can.

as the number of users ramps on the beta it might be likely that more code
paths get exposed to this problem so we should keep an eye out for additions to
the list in comment 0
bclary, here is what I used to get that list started.
% grep 0xfffffffff0de 20091103-crashdata.csv | awk -F\t '{print
$1,$11,$8,$14}' | sort | uniq -c | sort -nr | more
we might also need to refine the grep search term to catch more frame poisoned
crash addresses

This doesn't exactly surprise me; the whole point of frame poisoning was, we knew there were a ton of use-after-free bugs in the layout engine, and we wanted to make sure that none of them were exploitable. But some of them were silent, because the code using the freed data was finding something not entirely unlike what it would have found if the data hadn't been free yet, and carrying on. Now most of 'em are going to be prompt crashers.
As I said in bug 525579, the fix is, in general, to find the code that freed an object that still had outstanding references and make it not do that, or else to make those references go away.
I'd suggest a slight tweak to the grep pattern right now:
% egrep '0x7?f+f0de'
because x86-64 builds use a leading 7 to put the poison address in a region that's invalid at the hardware level (so we don't have to worry about what the OS might do). Shortly there is going to be code for all 32-bit Mac builds and 32-on-64 Linux and Windows that reserves the poison region at startup (there is no guaranteed-reserved address region on those OSes), though, and then there will not be a simple grep pattern anymore.

(In reply to comment #4)
> This doesn't exactly surprise me; the whole point of frame poisoning was
Nor us, and it's great for upcoming releases. But we do need to look for these because 1) they might become topcrashers in 1.9.2 with poisoning, and 2) these become a red-flag for anyone who knows we've done poisoning and wants to hack the millions of people who haven't yet upgraded to releases we haven't shipped yet :-)

I just realized that '0x7?f+f0de' is wrong, it won't match '0xf0de....' which is what you get with 32-bit pointers. '0x7?f+0de' would be better.
> ... we do need to look for these
> because 1) they might become topcrashers in 1.9.2 with poisoning, and 2) these
> become a red-flag for anyone who knows we've done poisoning and wants to hack
> the millions of people who haven't yet upgraded to releases we haven't shipped
> yet :-)
Oh, entirely agreed.
For the record, I think it *would* be possible to backport the QueryFrame stuff + the poisoning stuff to a 3.5.x security release without breaking anything; but it would be a large, tedious project that I personally do not want to do. Hopefully 3.6 uptake is quick enough that we don't need it.

(In reply to comment #9)
>
> For the record, I think it *would* be possible to backport the QueryFrame stuff
> + the poisoning stuff to a 3.5.x security release without breaking anything;
> but it would be a large, tedious project that I personally do not want to do.
> Hopefully 3.6 uptake is quick enough that we don't need it.
While you may not want to do it, personally, it's your job to follow through with this if need be. Is everyone here comfortable relying on 3.6 uptake rate being fast enough?

tomcat has been able to reproduce the crash #67 ranked signature out of orginal list in comment 5. crash content was found on a public site with random user hitting the content. The url was harvesting socorro, then run back though automation.
hurray for crash test automation! ;-)
Its the nsIFrame::GetView() crash in bug 528493
hoping for more like this one in the next round of automated testing.

This shows which of the crash signatures listed in chofmann's attachment that have not been seen in 3.5.5, meaning they're likely new crashes. This does not include any of the crashes for which we have no function name, as with only a module name and address there's no way to know if it's a known crash or not since the address changes with releases...

(In reply to comment #22)
> Chris, how many of those crashes appeared before FP landed?
if the crash signatures are moving around it might be hard to tell how many in my list are entirely new crashes, and how many are just new signatures. talking with jst about this he mentioned.
JohnnyStenback: exactly so for xul.dll@0xfoobar we really don't know
that would require looking at a bunch of crash reports by hand, or figure out some tool that could help us understand signatures that move around.

one thing we could do here to reduce risk is to get a beta deployed to a large number of users that has the fix for bug 526545
then tomcat could run another url crash testing run to find the click to crash urls that might be running in the wild.
the first run with the less effective url we are collected didn't turn up much, but with the new fix we might do a much better job of finding reproducible frame poisoning crashes.
I really thing we should get FP shipped as soon as possible, but the two things we need to do to make that happen quickly is to fix the most visible crashes, and have a strategy for back porting FP to 3.5.x quickly if someone discovers a way to turn the new 3.6 crashes into 3.5/3.0 exploits. this comment should probably go private before this bug ever gets opened up.

(In reply to comment #24)
> Damon, if it becomes necessary, it would be pretty easy to disable the
> poisoning in non-debug builds, or make it subject to a hidden pref, without
> backing the whole thing out.
A hidden pref *might* be a useful diagnostic tool in any event (apart from recognizing the poisoned frame signature). Note I'm happy we are finding these bugs.

I think choffman should own the meta - he's much more up on our process for finding these crash reports and aggregating them into bugs, than I am. For instance, I have no idea where the URL data he's posted in several of the bugs comes from.
I think it makes sense to block on the meta, since it's the general situation that we are not comfortable with shipping with, rather than any given one of the bugs. Based on my lack of luck reproducing them, I suspect any given one of the bugs is going to affect a relatively small user population - but the aggregate crashiness of them all is worrisome.
(Also, I would like to state for the record that I wish I hadn't signed on to this horrible albatross of a project in the first place, and the sooner I never have to think about it again, the better. Done complaining now.)

(In reply to comment #33)
> I think choffman should own the meta - he's much more up on our process for
> finding these crash reports and aggregating them into bugs, than I am. For
> instance, I have no idea where the URL data he's posted in several of the bugs
> comes from.
>
> I think it makes sense to block on the meta, since it's the general situation
> that we are not comfortable with shipping with, rather than any given one of
> the bugs. Based on my lack of luck reproducing them, I suspect any given one
> of the bugs is going to affect a relatively small user population - but the
> aggregate crashiness of them all is worrisome.
sure, I can own this meta bug which what I think means:
1) getting agreement an evaluation process by which to measure the impact of bug and determine what to do about them
2) make sure all, or the most important, bugs we know about all get evaluated and we have some tracking mechanism for tracking the completions of the evaluations
3) getting some agreement on a criteria by which we can arrive at some shipping decisions
I made some suggestions about item #1 in one of the other dependency bugs. I'll restate here:
we should be looking for three kinds of bugs, and pursuing fixes for each of these as we find them.
a) click to crash or low user interaction bugs
b) high volume fp crashes
and c) real simple and low risk fixes
if we fix the "a" set of bugs we reduce the risk of 3.5.x and 3.0 zero days.
if we fix the 'b' set of bugs we won't significantly back track on the stability progress made in other areas of 3.6
and if we put together several fixes for the 'c' class of bugs that will have the benefit of fixing one of the top FP crash bugs.
if every one agrees we can move on to the next stage. does this part make sense?

for part two in comment 34 which is
> 2) make sure all, or the most important, bugs we know about all get evaluated
> and we have some tracking mechanism for tracking the completions of the
> evaluations
I made a suggestion in comment 27. That might play havoc with the schedule, but I think it puts us on a faster path to shipping something we understand the risks of better. Interested in comments and feedback on that suggestion....

the other complicating part of what is going on are the changes to the skip list.
those changes are going to allow us to isolate and analyze specific problems easier and faster, but its also going to complicate things in the next few days as signatures move around. Having a good set of crash data from another high volume beta will also allow us to sort these signature and crash volume and crash distribution changes out faster.

https://bugzilla.mozilla.org/show_bug.cgi?id=526769 might be an example of
1. c) real simple and low risk fixes that come from code inspection and a quick look at each bug on the fp crash list.
and get the benefit of several low volume long tail crashes that add up to the same impact of fixing a top crash.

I don't think we should block on a meta-bug; if there are particular prominent crashes we can block on them.
What are the top unfixed crashes that were exposed by frame poiosoning, and how do they fit in the overall crash-stats list?

(In reply to comment #41)
> I don't think we should block on a meta-bug;
I agree in general, but not this specific case of the frame poisoning problems.
The blocking on this bug is about getting though the process of evaluting FP risks. When we have done the the evaluation process to our satisfaction, then this bug should be non-blocking.
> if there are particular prominent crashes we can block on them.
I think this agrees with point b from comment 34> a) click to crash or low user interaction bugs
> b) fix high volume fp crashes
> and c) real simple and low risk fixes found through code inspection
I also suggested 'a' and 'c' be added to the criteria.
any comments on those?
Attachment in comment 42 has the list of signatures hit by FP changes.

(In reply to comment #43)
> (In reply to comment #41)
> > I don't think we should block on a meta-bug;
>
> I agree in general, but not this specific case of the frame poisoning problems.
> The blocking on this bug is about getting though the process of evaluting FP
> risks. When we have done the the evaluation process to our satisfaction, then
> this bug should be non-blocking.
What do you mean by "evaluating FP risks"?
I don't see anything in this bug that comes within an order of magnitude of suggesting that we should even consider disabling frame poisoning.

If there are particular evaluation tasks, particular changes you think we should make, or particular crashes that you think should block the release, please make sure there's a bug filed on them and nominate that bug to block the release.
This bug is a metabug and there's no clear way to tell if it's fixed. Therefore it can't be a blocker for a release. Minusing.

(In reply to comment #44)
> I don't see anything in this bug that comes within an order of magnitude of
> suggesting that we should even consider disabling frame poisoning.
And to expand on why I think this: this is a drop in the bucket compared to our total crashiness. Of the 14615 crashes yesterday on 3.6b3 (per 20091123-crashdata.csv.gz), 413 of them had "f0de" in the crash address, and of those, most were fixed, since the top 10 such signatures were:
289 nsAccessibilityService::GetAccessible(nsIDOMNode*, nsIPresShell*, nsIWeakReference*, nsIFrame**, int*, nsIAccessible**)
51 nsAccessibleTreeWalker::PopState()
13 nsCOMPtr_base::assign_from_qi(nsQueryInterface, nsID const&)
7 nsIFrame::GetStyleDisplay()
7 nsCOMPtr_base::assign_with_AddRef(nsISupports*)
7 nsAsyncInstantiateEvent::Run()
6 nsXULPopupManager::HidePopupsInDocShell(nsIDocShellTreeItem*)
5 nsINode::GetOwnerDoc()
5 nsAccessibleTreeWalker::UpdateFrame(int)
4 nsStyleContext::GetStyleDisplay()
The top two of these (bug 525579 and bug 527287) are fixed (and perhaps others, but I'm not counting those). That leaves 73 crashes (413 - (289 + 51)) out of, let's (optimistically) say 10000 unfixed crashes.
I have no problem whatsoever accepting a <1% increase in total crashiness (and likely to improve over time) in exchange for a substantial improvement in security through the mitigation of most of the potentially exploitable security bugs in layout code.

(In reply to comment #44)
>
> What do you mean by "evaluating FP risks"?
>
making sure that at least some pct. of the 241 have had some eyes on it to see if there is a quick fix, or some user comments, or tomcat url testing that would lead to the discovery of easy steps to reproduce.
I'm not sure what that pct. is, but I'll make a proposal that it be 10%, or looking at the top 24 bugs. Maybe the number should be higher. Right now we are half way to that 10% goal.
Now that things are in motion to do beta4 with the improved URL reporting we should have better results from a tomcat automation run testing all the crash urls from frame poisoned crashes. I think that should be a blocking task tracked in this bug.
> I don't see anything in this bug that comes within an order of magnitude of
> suggesting that we should even consider disabling frame poisoning.
That's good. I don't think disabling frame poisoning in 3.6 is an option we should be considering either.

(In reply to comment #46)
> 13 nsCOMPtr_base::assign_from_qi(nsQueryInterface, nsID const&)
This is bug 528134. (That bug is not all of the crashes with that signature, but I think it is all or at least almost all of the crashes with that signature and a frame poisoning crash address.)

> 187. 1 0xfffffffff0dea800 Windows NT
seems to be the same. see bug 530965
It appears to be a set of crashes with signature [@ nsFrameManager::ReResolveStyleContext(nsPresContext*, nsIFrame*, nsIContent*, nsStyleChangeList*, nsChangeHint, int)]
that are both FP and non-FP crashes

(In reply to comment #47)
> making sure that at least some pct. of the 241 have had some eyes on it to see
> if there is a quick fix, or some user comments, or tomcat url testing that
> would lead to the discovery of easy steps to reproduce.
Why is evaluating these crashes more important than evaluating other, more common, crashes?

Perhaps because the crashes would represent regressions, from a user's point of view? Presumably, many of these crashes are cases where web pages now crash (safely), when they didn't used to crash at all (even though they exercised some dangerous code paths).

Looking at a week's worth of 3.6b3 poison-address crashes basically produced a very similar list of top crashes, with the addition of nsIContent::SetText(nsAString_internal const&, int), which is bug 515096.

> view? Presumably, many of these crashes are cases where web pages now crash
> (safely), when they didn't used to crash at all (even though they exercised
> some dangerous code paths).
I don't think that's a safe presumption. It's likely that many of these issues were already casuing crashes, just with much less predictable stacks/signatures: frame poisoning makes the crash locations much more predictable and easier to track and fix!

from comment 52bug 530877 (is not security closed)
> 6 nsXULPopupManager::HidePopupsInDocShell(nsIDocShellTreeItem*)
I filed a parallel bug 531075 to track the frame poisoning crashes that it looks like weave is tickling in at least some of those crash cases.
there are plans to promote we more real soon, so that could expose more people to these crashes, and discovery of the fp address crashes. any thoughts on if that one be also turned into an exploit against 3.5.x?

(In reply to comment #54)
> > view? Presumably, many of these crashes are cases where web pages now crash
> > (safely), when they didn't used to crash at all (even though they exercised
> > some dangerous code paths).
>
> I don't think that's a safe presumption. It's likely that many of these issues
> were already casuing crashes, just with much less predictable
> stacks/signatures: frame poisoning makes the crash locations much more
> predictable and easier to track and fix!
What I'm hoping for in this bug is to replace words like "presumably many" with some harder data.
If only 20 of the 200 fp address crashes can be recognized as a problem in 3.6, figured out as pattern by security researchers, and then turned into reliable or simi-reliable crashes on 3.5.x that could turn into "a month of frame poisoning crash bugs" zero-day fest on 3.5.
The race is on to test this theory. I'm hoping we stay ahead of the security researchers in understanding what is, and is not, possible with the 200 fp bugs.

(In reply to comment #56)
> If only 20 of the 200 fp address crashes can be recognized as a problem in 3.6,
> figured out as pattern by security researchers, and then turned into reliable
> or simi-reliable crashes on 3.5.x that could turn into "a month of frame
> poisoning crash bugs" zero-day fest on 3.5.
>
> The race is on to test this theory. I'm hoping we stay ahead of the security
> researchers in understanding what is, and is not, possible with the 200 fp
> bugs.
Why are the crashes that show up with frame-poisoning addresses more interesting for security researchers than other crashes with interesting-looking addresses, which we've had many of publicly in our crash database for years? (In many ways, I think they're less interesting, since the users with browsers exploitable by such bugs will start decreasing rapidly once we ship 3.6.)

They're more interesting in part because they're easier to find: they become deterministic crashes in 3.6, which are easier to find and analyze than the current non-deterministic ones. At that point, the task of exploit development against 3.5, the hypothesis says, is not much of a burden.

new list with only fp crashes seen by the 400k users of 3.6b4 this will weed out signatures for bugs fixed in the previous betas. 431 signatures seen over all in 3.6*, but only 146 seen in b4, so progress is being made.

(In reply to comment #61)
> Created an attachment (id=416784) [details]
> 3.6b4 fp crashes as of 2009 12 08 - 146 signatures
>
> new list with only fp crashes seen by the 400k users of 3.6b4 this will weed
> out signatures for bugs fixed in the previous betas. 431 signatures seen over
> all in 3.6*, but only 146 seen in b4, so progress is being made.
chofmann, do you want a new url run for this ?

The #1 crash, inDOMUtils::GetCSSStyleRules(nsIDOMElement*, nsISupportsArray**), looks like it should be fixed by bug 528134.
The #2 crash is bug 530877.
The #3 crash is bug 533016. Need more traction there.
A lot of the other stack signatures look like crashes during painting due to bad frame pointers, as if we were painting a corrupt frame tree. Bug 526216 is a randomly-reproducible crash of this type, so I want to look into that and maybe it will give us a clue about these others.