A leak scenario we see a lot is JS holding onto a dead Window object. This leads to holding onto a whole bunch of stuff.
We should consider cutting this chain somehow. For instance, in a "chrome holds onto content Window" scenario, the situation looks something like:
Chrome compartment | | Content Compartment |XPCWrappedNative
| | | owning
Window JSObject ->|->XPConnect Wrapper->|-> Window JSObject ->| nsGlobalWindow
It seems like we might be able to cut out some of the middle here (in particular the content compartment). A similar scenario might apply with windows of differing origins.

I think we might be able to do something here, but we need to be fairly careful in terms of when we sever the connection. Currently, when a window is closed or navigated (and not held in the bfcache) we clear the scope of the window, nuking "expando" properties (properties not declared in IDL but stuck on the window object via global variables, etc.); however, IDL-defined properties do continue to work thanks to XPConnect's lazy resolution.
So, by doing this, we could potentially break code that actually does want to maintain a reference to a window, but is holding it alive, since we're kind of breaking the contract of a GC.
That being said, the easiest (and possibly cleanest) place to implement this would probably be to detach the cross compartment wrapper in the JS engine (fastest way to glory would be to brain transplant it with a /dev/null proxy). We could even warn in the error console when we do so to try to give developers a chance to clean up their stuff.

Expandos can't be accessed cross-compartment anyways (with the exception of chrome touching content), right? (I suppose this changes in the compartment-per-global world :-/) I think we could get away with saying that chrome has to deal with the change and then only do this for cross-compartment wrappers.
This will get trickier with compartment-per-global unfortunately.

(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #4)
> Expandos can't be accessed cross-compartment anyways (with the exception of
> chrome touching content), right?
document.domain can cause this too, even before compartment-per-global.
We could definitely do this for chrome -> content Xray wrappers only if we wanted to, though.

(In reply to Blake Kaplan (:mrbkap) from comment #5)
> We could definitely do this for chrome -> content Xray wrappers only if we
> wanted to, though.
I think that's a good place to start, especially since leaks caused by chrome are generally much longer lived than leaks caused by content.

A recap of the basic idea here:
When we tear down a Window, we reach into the chrome compartments and grab all of the cross-compartment wrappers. We iterate through them and find any that point to objects that are parented to the dying window. We then replace the cross-compartment wrapper's proxy handler with a special handler that throws on every access, and clear out the slot that pointed to the cross-compartment value.
There is some special handling here for Windows. We don't want to break the wrapper that points to the Window object when the inner window is torn down, but rather we want to break it only when the outer window is torn down. Thus our function needs to know whether or not it should break wrappers that point to the Window.

Comment on attachment 616000[details][diff][review]
Patch
I just noticed that if you pass DontNukeGlobalObject for nukeGlobal, then the function doesn't seem to do anything: it always executes the continue statement. Should the if condition use && instead of || ?
> 'CompartmentIter' doesn't exist, as far as I can tell.
Sorry, I guess it's 'CompartmentsIter'. It's in jscompartment.h.
> You mean iterating over the wrappers and then iterating over the array of
> wrappers we found to get rid of?
Yeah, I was thinking it might be easier to replace the call to |toNuke.append(v)| with something like this:
e.removeFront();
SetProxyPrivate(wobj, JSVAL_NULL);
SetProxyHandler(wobj, &DeadObjectProxy::singleton);
Then you could remove the toNuke loop. Unless I'm being dumb and there's some reason the extra loop is important.
Also, can you replace JS_TRUE and JS_FALSE with true and false? We're now doing that even for functions that return JSBool.

(In reply to Bill McCloskey (:billm) from comment #19)
> Comment on attachment 616000[details][diff][review]
> Patch
>
> I just noticed that if you pass DontNukeGlobalObject for nukeGlobal, then
> the function doesn't seem to do anything: it always executes the continue
> statement. Should the if condition use && instead of || ?
Ah crap, you're right.
> > 'CompartmentIter' doesn't exist, as far as I can tell.
>
> Sorry, I guess it's 'CompartmentsIter'. It's in jscompartment.h.
Yeah.
> > You mean iterating over the wrappers and then iterating over the array of
> > wrappers we found to get rid of?
>
> Yeah, I was thinking it might be easier to replace the call to
> |toNuke.append(v)| with something like this:
> e.removeFront();
> SetProxyPrivate(wobj, JSVAL_NULL);
> SetProxyHandler(wobj, &DeadObjectProxy::singleton);
> Then you could remove the toNuke loop. Unless I'm being dumb and there's
> some reason the extra loop is important.
I don't think its important, I just copied from the other function that does wrapper transplanting.
> Also, can you replace JS_TRUE and JS_FALSE with true and false? We're now
> doing that even for functions that return JSBool.
Ok.

Also, I'd appreciate if this could be tested on top of compartment-per-global before landing so that we don't add further road blocks on that front.
The patches are here: https://github.com/bholley/mozilla-central/commits/cpgtrain
Currently there's 3 patches (one of which is some test fixes that have yet to land), and the two r=mrbkap patches. Current failures are a crash in the android reftest harness, and a DOMWindow/DocShell leak orange on some (but not all) platforms. I'll hopefully have answers for all that soon.

(In reply to Bobby Holley (:bholley) from comment #27)
> Also, I'd appreciate if this could be tested on top of
> compartment-per-global before landing so that we don't add further road
> blocks on that front.
>
> The patches are here:
> https://github.com/bholley/mozilla-central/commits/cpgtrain
>
> Currently there's 3 patches (one of which is some test fixes that have yet
> to land), and the two r=mrbkap patches. Current failures are a crash in the
> android reftest harness, and a DOMWindow/DocShell leak orange on some (but
> not all) platforms. I'll hopefully have answers for all that soon.
Yeah, this patch adds some new test failures. I'm not going to block landing this, but I'll work on fixing them.

(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #31)
> Yeah, this patch adds some new test failures. I'm not going to block
> landing this, but I'll work on fixing them.
Um, can we? I've spent months getting Mochitest-O green with CPG, and this patch turns it orange as Halloween.
CPG is days away from landing, and a much larger effort than this bug. If you disagree with me here, please get jst's opinion before pushing.

I can work on these test failures while you work on the other remaining stuff. If we do get to a point where this stuff is the only remaining blocker its pretty easy to disable cutting chrome->chrome wrappers.
Also I already pushed this before you told me not to :-)
https://hg.mozilla.org/mozilla-central/rev/cc5254f9825f

(In reply to Bobby Holley (:bholley) from comment #27)
> Currently there's 3 patches (one of which is some test fixes that have yet
> to land), and the two r=mrbkap patches. Current failures are a crash in the
> android reftest harness, and a DOMWindow/DocShell leak orange on some (but
> not all) platforms.
Turns out the second issue here is an orangefactor thing and unrelated to cpg. See bug 734554 comment 71.

Adding dev-doc-needed since this is something developers should be aware of. At the least, it should be mentioned in Firefox 15 for Developers on MDC and the leak testing wiki pages, and probably somewhere on the pages about wrappers. If someone can spell out the exact ramifications of the final patch, I can do the docs.

This patch seems to be very powerful to stop memory leaks from add-ons.
But in terms of bugfixing in Firefox, can it be a case of not seeing the forest for the trees, I mean, is this patch not going to hide some legitimate memory leaks in Firefox code and prevent from fixing them?

> But in terms of bugfixing in Firefox, can it be a case of not seeing the forest for the trees, I
> mean, is this patch not going to hide some legitimate memory leaks in Firefox code and prevent from
> fixing them?
No, in fact it's going to fix the memory leaks in Firefox itself, too. For example, bug 675412 was fixed by this bug.

> is this patch not going to hide some legitimate memory leaks in Firefox code and prevent from fixing them?
It would be theoretically possible for there to be a large glob of memory held alive in chrome that shouldn't be, that would only be noticed because that glob of memory happened to hold alive content that shows up as a zombie compartment. But there is no guarantee that the chrome blob would hold alive content, so it may not be noticed even without this patch. Furthermore, in practice, chrome leaks have been a problem mostly because they keep alive content, not because they are so bad themselves.

> But in terms of bugfixing in Firefox, can it be a case of not seeing the
> forest for the trees, I mean, is this patch not going to hide some
> legitimate memory leaks in Firefox code and prevent from fixing them?
I understand your concern -- in a way this is less elegant or "pure" than just fixing the leaks. But experience has shown us that even in Firefox itself, "just fixing the leaks" is *really* hard. And "just fixing the leaks" in *add-ons* is an endless game of whack-a-mole. We'll never even get close.
With that in mind -- bring on the pragmatic solution! :)

This patch's landing has interesting implications for the top 100 add-on testing (bug 730737). Lots of leaks that will manifest in older versions won't manifest in versions that have this patch.
There's a similar question about AMO reviews -- should reviewers still look for zombie compartments? Most (all?) of them will no longer be possible. Perhaps our documentation about this should be updated?
I'm inclined to not take any action immediately, just to give the patch time to bake in nightlies for a while, and to see the effects. Speaking of which, do we have ideas how to gauge the effect of this patch in the wild? I guess telemetry will give us some ideas.

(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #41)
> If we want to continue testing addons we should pick a fixed revision of
> Gecko to test against that's before this patch (how about the base of Aurora
> 14?)
I would like to continue testing them until this patch makes it into a Firefox release. Assuming it doesn't get backed out, that'll be Firefox 15, which will be released on August 28.
See also bug 745149 comment 8.

Hi, can you please clarify one thing for me :)
Imagine a scenario.
1. Hypothetical leaky chrome creates a reference from chrome to document.body of every website
2. I open ten twitter feeds
3. I close nine twitter feeds
This bug didn't change anything in this scenario: ten DOM trees remain in memory because they're referenced, right?
The "leak" (lifetime problem) described above is still bad. My concern is that it can no longer be found as a zombie compartment, so it might be harder to find or easier to ignore.
Are my worries justified in any way? :)

> This bug didn't change anything in this scenario: ten DOM trees remain in memory because
> they're referenced, right?
No. This bug is all about breaking the references. After step 3 in your steps, you have nine references in chrome that don't reference anything and throw if you try to do anything with them, and one reference to the document.body of the remaining twitter feed.

(In reply to Boris Zbarsky (:bz) from comment #44)
> No.
Ah! And I thought the references are only broken when the *compartment* is trying to disappear (which only happens after all ten pages are closed, not nine).
Thanks for answering :)

Lower End systems are seeming to experience on Windows Xp & Linux systems
"Not Responding Situations from Chrome for several seconds"(heavy sites)
https://bugzilla.mozilla.org/show_bug.cgi?id=105843
would this bug be also a part of this bug(depends)? As both help each other & will help in speeding up
overall performances(cache preferred instead of loading into mem again) well as reduce loading times & less zombie compartments

Good news: I had an old version of McAfee SiteAdvisor 3.4.1 -- the one that leaked every content compartment ever opened -- lying around. I reinstalled it. With a release build (FF12) it still exhibited the problem, but with a Nightly build there were no zombie compartments! Mission accomplished.

Another add-on fixed by this change:
Bug 732782: AutoPager 0.7.1.4
https://bugzilla.mozilla.org/show_bug.cgi?id=732782
Before: Visit Google, click "yes" when offered the rule for the page, do a search,
scroll to the bottom of the page. google.com becomes a zombie compartment.
After: No zombie compartments!

FYI: It appears this patch is causing perfectly legitimate code to break. I'll be changing my plugin to compensate, but here is what I'm doing:
if(this.ElementCount && this.ElementCount.parentNode)
this.ElementCount.parentNode.removeChild(this.ElementCount);
It's the this.ElementCount.parentNode call that is throwing the exception. I have a reference to this.ElementCount and it has already been removed from the DOM, however after a page change, accessing that ElementCount.parentNode causes "Can't access dead object."
I understand and agree with the intent of this fix, but I believe that the above code should not be throwing an exception, should it?

If elementCount or elementCount.parentNode points to some object in a closed window/tab, then throwing
exception is the right thing to do.
Removing element from DOM isn't enough, if the element has still its ownerDocument pointing to the
document in the closed window/tab.

> I have a reference to this.ElementCount and it has already been removed from the DOM
But it still keeps that DOM tree alive via its ownerDocument pointer. So your extension was in fact leaking the web page that this.ElementCount came from.

> What if I wanted to add it to the new document
Then you should be creating it in the chrome document and using importNode to create copies for content documents as needed.
> then why would it's ownerDocument point to anything?
Because that's how ownerDocument is defined. It points to the document the node was created in (or possibly adopted into if it was adopted), even when it's not in the DOM tree. You can test this trivially in a webpage:
alert((new Image()).ownerDocument == document);
for example.

> Because that's how ownerDocument is defined. It points to the document the
> node was created in (or possibly adopted into if it was adopted), even when
> it's not in the DOM tree. You can test this trivially in a webpage:
When you say defined, are you referring to W3C standards? It doesn't make sense to me that if I have removed an element from a document, that checking to see if it has a parentNode should throw an exception. Seems an internal FF issue to me since I may want to keep that element and add it to or adopt it to another window/document at a later time and shouldn't expect an exception just by checking if it has a parent node.
Is there some other way to remove it properly from a document besides the removeChild() I'm doing there?
Don't get me wrong here, I don't want to cause memory leaks, I just don't see why what I am doing *should* do that, since my intention by the removeChild() was to divorce it from the document I had added it to (thereby leaving no pointer to that document to cause a memory leak.)

(In reply to Clint from comment #61)
> Don't get me wrong here, I don't want to cause memory leaks, I just don't
> see why what I am doing *should* do that, since my intention by the
> removeChild() was to divorce it from the document I had added it to (thereby
> leaving no pointer to that document to cause a memory leak.)
But that's not possible. Because of the way ownerDocument is defined, a node can never be completely divorced from the document (without associating it with a new document, via adoptNode).

I was just looking through the DOM level 3 spec, doesn't say anything about .ownerDocument having to stay defined, but I can see how it would be useful in many cases to know which document it was last a part of, and that even if removed, it is still a part of that document.
In any case, is there a way to completely remove it from a document besides removeChild()?

> In any case, is there a way to completely remove it from a document besides removeChild()?
Kyle said in comment 62:
> Because of the way ownerDocument is defined, a node can never be completely divorced from [a]
> document (without associating it with a new document, via adoptNode).
So even removeChild() is not sufficient to "completely [divorce] a node from [its] document". A node is always associated with some non-null document.
If you really want to keep these nodes around, your plugin could adoptNode them into some document which sticks around forever.
> Seems an internal FF issue to me since I may want to keep that element and add it to or adopt it
> to another window/document at a later time and shouldn't expect an exception just by checking if
> it has a parent node.
Boris and Olli are two of Mozilla's foremost experts on web standards. I agree that the behavior here is counter-intuitive and understand that this seems like an internal FF issue, but you may have to believe them when they say that this is How The Web Works and that it isn't a bug.
The DOM spec [1] says that the ownerDocument is changed by the |adopt| algorithm. If you search through the spec, you'll notice that we never invoke the adopt algorithm with a document which could be null.
[1] http://dvcs.w3.org/hg/domcore/raw-file/tip/Overview.html#concept-node-document

Makes sense, I would just argue whether an exception should be thrown when I'm trying to read if the element has a parentNode or not, I see the dilemma though, that I was holding a reference to a document that was gone. Perhaps in those cases, the ownerDocument (since it was destroyed) should be set to null.
In my particular instance, I was done with the node at that point, so I just delete this.ElementCount node now and I'm done with it.
You may want to re-visit the idea that an exception is being thrown during a benign attempt to check a valid property of an object though.

To be clear: the behavior implemented in this bug is technically a violation of web standards. Per standards, if you're holding a node, we should keep alive that node's document as well as all nodes in that document, that document's window, and so forth.
Of course that means that when you hold a node you leak the entire document the node came from, its window, and everything hanging off that window.
This can be a problem for web sites, even, if they do a lot of navigation in subframes and hold on to elements from those subframes. But there we can't change anything about breaking web compat. For extensions, though, the situation is different. Yes, the change that was made breaks compat. It was judged to be ok to do that in order to avoid leaks.
Specifically, the fix that was implemented in this bug is to replace the node you're holding with a totally different object that's not a node at all, so the node can go away, which it has to do so that it's document can go away. This totally different object throws on _all_ property accesses, period, because it has no idea it's taking the place of a node.
It _would_ be possible to try for some sort of more complicated fix that would make the parentNode check work, but then you might think you have a node... whereas you do not, ad trying to do anything like inserting it into a DOM tree would fail. So the decision was made to go for a "fail early" approach so subtle bugs don't crop up. I realize that in your particular case that means that things broke when they might not have otherwise, and I'm sorry about that.

I have an addon that holds a reference to simple storage:
var ss = require('simple-storage')
At some point in the future, when I check the quota:
if (ss.quotaUsage > 1) {
// Do some clean up
}
I am now seeing:
TypeError: can't access dead object
Can I no longer hold a simple storage reference and must require('simple-storage') each time, to avoid this?

Yes, simple-storage is part of the Addon SDK. I thought that perhaps I was holding onto a reference when I shouldn't. But it seems to me that simple-storage, as part of the SDK, should at least remain alive for the lifetime of the addon until it is disabled or uninstalled.
I am seeing exceptions in simple-storage also onunload, besides during runtime.
Exception trace on disable:
error: An exception occurred.
Traceback (most recent call last):
File "about:addons", line 1, in oncommand
<?xml version="1.0"?>
File "chrome://mozapps/content/extensions/extensions.xml", line 1023, in set_userDisabled
this.mAddon.userDisabled = val;
File "resource:///modules/XPIProvider.jsm", line 7749, in
XPIProvider.updateAddonDisabledState(aAddon, val);
File "resource:///modules/XPIProvider.jsm", line 3831, in XPI_updateAddonDisabledState
BOOTSTRAP_REASONS.ADDON_DISABLE);
File "resource:///modules/XPIProvider.jsm", line 3711, in XPI_callBootstrapMethod
this.bootstrapScopes[aId][aMethod](params, aReason);
File "file:///Users/plin/Library/Application%20Support/Firefox/Profiles/v82uccod.dev/extensions/jid0-pPr9ydvwjKWJx1jvtqVelWfLgzo@jetpack/bootstrap.js", line 200, in shutdown
unload(loader, reason);
File "resource://jid0-ppr9ydvwjkwjx1jvtqvelwflgzo-at-jetpack/api-utils/lib/loader.js", line 307, in unload
notifyObservers(subject, 'sdk:loader:destroy', reason);
File "resource://jid0-ppr9ydvwjkwjx1jvtqvelwflgzo-at-jetpack/api-utils/lib/system/events.js", line 58, in
data: data
File "resource://jid0-ppr9ydvwjkwjx1jvtqvelwflgzo-at-jetpack/api-utils/lib/unload.js", line 74, in onunload
unload(reason);
File "resource://jid0-ppr9ydvwjkwjx1jvtqvelwflgzo-at-jetpack/api-utils/lib/unload.js", line 53, in unload
observers.forEach(function(observer) {
File "resource://jid0-ppr9ydvwjkwjx1jvtqvelwflgzo-at-jetpack/api-utils/lib/unload.js", line 55, in
observer(reason);
File "resource://jid0-ppr9ydvwjkwjx1jvtqvelwflgzo-at-jetpack/api-utils/lib/unload.js", line 64, in
unloaders.slice().forEach(function(unloadWrapper) {
File "resource://jid0-ppr9ydvwjkwjx1jvtqvelwflgzo-at-jetpack/api-utils/lib/unload.js", line 65, in
unloadWrapper(reason);
File "resource://jid0-ppr9ydvwjkwjx1jvtqvelwflgzo-at-jetpack/api-utils/lib/unload.js", line 38, in unloadWrapper
originalDestructor.call(obj, reason);
File "resource://jid0-ppr9ydvwjkwjx1jvtqvelwflgzo-at-jetpack/addon-kit/lib/simple-storage.js", line 130, in JsonStore_unload
this._write();
File "resource://jid0-ppr9ydvwjkwjx1jvtqvelwflgzo-at-jetpack/addon-kit/lib/simple-storage.js", line 157, in JsonStore__write
if (this.quotaUsage > 1)
File "resource://jid0-ppr9ydvwjkwjx1jvtqvelwflgzo-at-jetpack/addon-kit/lib/simple-storage.js", line 78, in
JSON.stringify(this.root).length / this.quota :
TypeError: can't access dead object
Is the dead object marking too aggressive and prematurely setting them unusable?

As I mentioned, I think that this is a larger issue than just simple-storage.
I also see these errors on a simple callback from xhr. For example:
var obj = { /* something */ };
var xhr = new XMLHttpRequest();
xhr.open("GET", url, true);
xhr.onreadystatechange = function() {
// Manipulate obj
obj.foo = 1;
};
xhr.send();
obj should remain live, but it has already been marked as a dead object. Am I crazy or is this behaving unexpectedly? This is a common pattern...

Sure. I did not want to open a new issue for each "dead object" error that I am seeing, as I believe that they are all related. It just seems that in a number instances, legitimate live objects are getting marked as dead.
In the previous case, this has no relation to the DOM or Window, and runs within the main addon code. I am not seeing the error occur every pass through the code fragment, but I believe that to be more of a timing issue.
Is there, or should there be, a general "can't access dead object" issue?

> Sure. I did not want to open a new issue for each "dead object" error that I am seeing, as I believe
> that they are all related. It just seems that in a number instances, legitimate live objects are
> getting marked as dead.
If you want to file just one bug for all the issues which you believe are related, or use bug 796942 for that purpose, that's a fine start. We can always split the bug if we discover that the issues are unrelated. (Obviously, by "X and Y are related", I mean something more than "X and Y are both caused by bug 695480"; there are tons of ways to cause this error, but your issue is that you believe we're throwing that error when we shouldn't be.)