Currently there is a pretty-fast path from wasm->JIT. However, the path from JIT->wasm goes through a generic C++ WasmCall function which is probably 100x slower than a JIT JS->JS call.
An enabling fix to both is allowing wasm to share a JitActivation with other JIT code and killing WasmActivation. This involves teaching the JIT stack iterators (profiling and normal) to walk into and out of wasm code and removing various other dependencies on WasmActivation.
With that fixed, JS should be able to call wasm more efficiently than calling (non-inlined) JS, and wasm calling JS should be about the same as JS<->JS.

I'm looking at this now. A few thoughts about implementation:
- we can start by using the most generic path in the JIT to handle fast calls from jit->wasm, i.e. LCallGeneric. To do this, we need a way to get the address of the fast-jit-to-wasm entry ("fast entry") from the JSFunction.
- a wasm JSFunction already has a native, which is WasmCall. It's used to know if a function is a wasm/asm.js export, in which case the JSFunction is extended and has 2 slots (one for the instance, one for the function index). With both, we could in theory look up the FuncExport that would lead to the fast entry code: but that would imply a VM call to do the binary search in the func exports table, and it would probably kill performance, so this is excluded. The VM call would handle tiering, though.
- the fast entry address could be stored as a new extended slot in the JSFunction, containing a private pointer being the fast entry address. In practice, we could be okay with this, since wasm baseline code would hopefully escape to ion code quickly. Doesn't sound too appealing though, and it wouldn't handle tiering.
- another idea I had was to use another kind of jumpTable, but used to store the addresses of the fast entry to functions (say, jsJitEntryTable). Then, an extended slot (or JSJitInfo) could store the address of the jsJitEntryTable and use that instead to find the fast entry address before calling it.
With this last idea and some other discussed with Luke, we could limit the amount of changes necessary to get from the JSFunction callee to the actual code pointer in JIT code, by:
- using JSJitInfo (or flags) to tell that a function is a wasm export
- having the JSFunction native field store the jsJitEntryTable address (= base + function index).
It would magically work because going from a JS Jit callee to the code takes two pointer dereferences (JSFunction -> JSScript -> ionOrBaselineEntry), and the wasm one would take two as well (JSFunction -> jit entry table -> fast entry).
Then, every place that knows how to handle a jit call to a regular JSFunction could also handle a fast call to wasm (so that's LCallGeneric, probably LCallKnown, and the arguments rectifier at least).
I'll try that next week. Happy to hear other ideas or comments about this one.

Overall, sounds great!
Technically, I think we might not need the jsJitEntryTable: if the fast jit-to-wasm entry of tier1 is stored in a JSFunction extended slot, the entry stub will call into a tier1 function prologue which will immediately call through Code::jumpTable() to reach the best tier. But if we're doing two loads to get the callee *anyway*, then what you are proposing avoids "wasting" that second load by putting it to good use for tiering.

Folded patch (to be split later).
- can handle simple calls from jit to wasm (with i32 parameters) on x64
- ~can transition from wasm to jit frames in non-profiling iteration (to be confirmed with more testing; just a single stack printed from wasm called from jit worked)
Remaining work:
- handle errors thrown by wasm
- implement OOL paths when the Value conversion to primitive fails
- consider the {int64,NaN} {parameter,return values} cases
- a lot more testing
Considering that a lot is missing, I won't throw any perf numbers in yet, but just as a teaser: things are getting *much* better.

And a new wip patch, which now also handles asm.js functions (except for functions that return SIMD; it might have been simpler to just forbid this, considering the discussions happening in bug 1416723). This is great, because I was afraid of poor test coverage of this feature, but now we have all the asm.js tests in addition to the wasm ones.
Almost all the tests are passing on x64 (except for one correctness issue in testZOOB and mysterious crashes on SIMD.js tests, which probably require some guarding or something). Next steps: check other platforms + async profiling iteration, then benchmark perf.

This is a subset of the big WIP patch that can be landed independently.
This changes the layout of JSFunction so that there's a way for wasm functions to declare they have an optimized JIT entry; this is half the work so that MacroAssembler::loadJitCodeRaw [1] also works with wasm functions (the other half is in the WIP patch).
The flat that's introduced is not being used yet in this patch, so this can be separated from the bigger chunk. A few helper functions have been introduced too (some might be unused here, but they will be in the next few patches, that I'm hoping to upload in the next few days).
With this new field for JSFunction, we also need to make sure that we don't confound jitInfo with the wasm's jit entry, so this introduces JSFunction::hasJitInfo() -> bool, which explains the size of this patch. (I've used the z-renaming technique to make sure I caught all uses of jitInfo())
[1] https://searchfox.org/mozilla-central/source/js/src/jit/MacroAssembler.cpp#1674

First wip with profiling mode not blowing the ARM simulator up \o/ (see jit-tests/tests/wasm/ion-error-ool.js for instance: it tests that we do take the optimized entry; it tests what the stack trace should look like, both in profiling and non-profiling mode)

Comment on attachment 8942300[details][diff][review]
wip.patch
Review of attachment 8942300[details][diff][review]:
-----------------------------------------------------------------
Luke, if you had time to have a first look, it'd be great! I'll be polishing before review: cleaning out code, having more tests pass (the profiling tests have started using the jit entry), enabling more ways into the optimized entry (Function.prototype.apply, getters/setters, scripted proxies (?), etc), doing more perf analysis and see if we can make the entry stub even faster, etc.
I guess there are three (four if tests are separated) big different chunks of code that can be looked at ~independently:
- the creation of the new JumpTables, mainly in WasmCode/WasmGenerator/WasmJS/WasmModule
- the enabling of the fast entry in the jit/ files (BaselineIC/CodeGenerator/IonBuilder/MIR/Lowering/MacroAssembler)
- the fast entries themselves, with frame iteration (WasmStubs/WasmFrameIter, as well as Stack.cpp with minimal changes and a lot of symmetry with the wasm->jit path \o/)
I'll split the patch this way next week.
::: js/src/wasm/WasmFrameIter.cpp
@@ +894,5 @@
> + if (offsetInCode < codeRange->jitEntryPushedRA()) {
> + // We haven't pushed the jit return address yet, thus the jit
> + // frame is incomplete and we won't be able to unwind it; drop it.
> + return false;
> + } else if (offsetInCode < codeRange->jitEntryFPStart()) {
nit: else after return

I've done simple stats on these stubs, to find out where memory was taken:
- first of all, these programs have many more exported functions than imported functions: AngryBots has 19k exported functions vs 535 imported, Godot has 19.5k exported functions vs 244 imported ones. So we'll assume this is the norm to have a lot of exported functions.
- the code size in stubs is generally taken by the arguments conversions, which test for/handle many JS::Value tags before giving up. In AngryBots, the median size of the subsection that just converts input arguments is 266 bytes (min: 0, max: 4327 bytes), with a total of ~7 MB just for these. OOL jumps (that call into the JitEntryOOL functions, which then pass to C++) take a total of ~1.3 MB. Since the overall memory overhead was ~10.6 MB for AB, most of it (~80%) is in these subsections.
- last interesting fact is that the number of unique signatures in exported functions is 234 for AngryBots and 177 for Godot.
- loading the TLS register from the JSFunction takes overall ~1 MB, and looks like the second thing that could be optimized after the arguments conversion.
Some (independent) ideas to reduce this memory size:
1. lazy stub generation, with a lazy jit entry that gets only generated if the jit entry itself is used. Since the jit entry would be used as soon as the baselinejit calls the function (i.e. after 10 calls), it might be worth to just have the jit entry generated when the interp entry itself is lazily generated.
2. have per-signature arguments conversion stubs (which could be per-process), since there are so few different unique exported function signatures (relative to the number of exported functions). Since this would introduce a new call (jit -> jit entry -> signature unboxing stub), it might kill the performance of the jit entry stub, though...
3. have the JSFunction store the TlsData pointer in a new extended private slot; this should divide the size of the TLS loading by a factor of two or three.
Now, the way to go ahead: I propose to split the current big patch into smaller patches for reviews (if not landing with the optimization *disabled*, since there's one function that controls it). Then we can work on memory improvements / lazy stub generation afterwards and enable it later.

Gary, Christian: can you please fuzz this patch? Anything with asm.js/wasm and --baseline-eager or --ion-eager are of particular interest to me.
It applies cleanly on top of mozilla-inbound 28a267774e17.

And the final patch that binds everything together:
- make sure that JSScript:offsetOf{Raw,ArgsCheck} jit entries are at the start of the JSScript struct (it's statically asserted in the previous patch) so the assembly routine that loads code is the same for jit code (load script/load raw or argscheck) and for the wasm fast entry (load jit entry table entry/load raw or argscheck)
- use the native jit entry where we would call scripted functions

Comment on attachment 8944455[details][diff][review]
rolledup.patch
Gary, Christian: can you please fuzz this patch? Anything with asm.js/wasm and --baseline-eager or --ion-eager are of particular interest to me.
It applies cleanly on top of mozilla-inbound 28a267774e17.

This (incomplete) patch reduces the total jit entry stubs size from ~10 MB to ~2 MB, by splitting out the arguments conversion from the jit entries to per-instance (atm, but they could be per-process later) stubs keyed by signature.
This also regresses perf compared to the previous patch (since there's one more call/pushing frame/popping frame/ret) by 10% on micro benchmarks, making us in the ballpark of being ~30% slower than v8 on the same micro benchmarks.
Luke, does this sound like an good initial tradeoff for landing?

Landing the jitconverters patch would make sense if the lazy-stub approach would take a while, but it seems to me like lazy-stubs might not take that long. Since lazy-stubs will actually be a net size/compile-time improvement and it allows us to maintain the optimal JIT->wasm call performance, I'm really in favor of that if it's a practical option.
(In reply to Benjamin Bouvier [:bbouvier] from comment #28)
> - first of all, these programs have many more exported functions than
> imported functions: AngryBots has 19k exported functions vs 535 imported,
> Godot has 19.5k exported functions vs 244 imported ones. So we'll assume
> this is the norm to have a lot of exported functions.
Yes; what is making so many exported functions is that every function used in an elem section of an imported/exported ("external") Table has to be marked as exported (since it might be called via Table.get()).

Comment on attachment 8944451[details][diff][review]
4.stubs.patch
Review of attachment 8944451[details][diff][review]:
-----------------------------------------------------------------
Looking really good; just need to take some more time to carefully read the logic in WasmStubs.cpp. Here's comments so far, mostly nits.
::: js/src/vm/Stack.cpp
@@ +1738,5 @@
> MOZ_ALWAYS_TRUE(wasm::StartUnwinding(state, &unwindState, &ignoredUnwound));
>
> void* pc = unwindState.pc;
> +
> + // In the prologue/epilogue, FP might have been fixed up to the caller's FP,
Could you rename 'ignoredUnwound' to 'unwound' and put this whole code block inside "if (unwound)"? This both asserts harder and makes it clear that when (I might have to do this soon b/c of a corner case in my trapping patch) StartUnwinding() is changed to not ever unwind (but rather return <TlsData*, Code*, caller fp, caller pc> tuples), this logic can go away.
::: js/src/wasm/WasmFrameIter.h
@@ +69,5 @@
> const Code* code_;
> const CodeRange* codeRange_;
> unsigned lineOrBytecode_;
> Frame* fp_;
> + uint8_t* ionCallerFP_;
IIUC, ionCallerFP_ is set when unwoundAddressOfReturnAddress_ is set and only should be accessed when the iterator is done(), so could you rename this field unwoundIonCallerFP_ as well as the method and also assert done()?
@@ +216,5 @@
> GenerateJitExitPrologue(jit::MacroAssembler& masm, unsigned framePushed, CallableOffsets* offsets);
> void
> GenerateJitExitEpilogue(jit::MacroAssembler& masm, unsigned framePushed, CallableOffsets* offsets);
> void
> +GenerateJitEntryOolPrologue(jit::MacroAssembler& masm, unsigned framePushed,
nit: could you put a \n before this function decl?
::: js/src/wasm/WasmJS.h
@@ +146,5 @@
> + static const unsigned INSTANCE_SLOT = 0;
> + static const unsigned EXPORTS_OBJ_SLOT = 1;
> + static const unsigned EXPORTS_SLOT = 2;
> + static const unsigned SCOPES_SLOT = 3;
> + static const unsigned INSTANCE_SCOPE_SLOT = 4;
Instead of making them all public, how about keeping them all private and having a
static unsigned offsetOfInstanceSlot() {
return NativeObject::getFixedSlotOffset(WasmInstanceObject::INSTANCE_SLOT);
}
with the usual "// Read by JIT code:" comment.
::: js/src/wasm/WasmStubs.cpp
@@ +563,5 @@
> + unsigned offset = frameSize + JitFrameLayout::offsetOfCalleeToken();
> + masm.loadFunctionFromCalleeToken(Address(sp, offset), ScratchIonEntry);
> +
> + // ScratchValIonEntry := callee->getExtendedSlot(WASM_INSTANCE_SLOT)
> + // => ObjectValue(WasmInstanceObject)
As a possible follow-up optimization: if we kick the function-index out of the FunctionExtended object (being able to compute it on (rare) demand by looking up fun->u.wasm.jitEntry to find the CodeRange which would hold the function index), then we could store the TlsData* in the function object. Maybe this would let us match v8's numbers?
::: js/src/wasm/WasmTypes.h
@@ +1017,5 @@
> + uint32_t pushedRA;
> +
> + // The offset to the first instruction for which FP is set to the jit
> + // caller.
> + uint32_t setFP;
Would it be possible to give WasmFrameIter.h a new GenerateJitEntry(Prologue|Epilogue) (symmetric to GenerateJitExit(Prologue|Epilogue)) that was called by GenerateJitEntry in WasmStubs.cpp so that the JIT entries and unwinding was symmetric with the other prologues/epilogues. Specifically this would allow these two fields to be static constants used by StartUnwinding() just like the other prologues and remove the need for JitEntryOffsets. This isn't a size savings because CodeRange has a fixed size, but I feel like the symmetry would be useful especially if we build on this in the future with other types of JitEntries (monomorphic...) or other types of prologues.

Implemented new optimizations discussed:
- two loops to do the JSvalue tag checking + unboxing afterwards (avoids the exception / oolConvert stubs)
- remove FUNC_INDEX from the extended slots to store the tlsData pointer instead. (jsfun associated bits in the next patch). Delta: -3 loads or so
- also remove storing of TlsReg on the stack. It needed to be there before, because in case of exception, we'd reload it from the stack to get the jit exception handler. Now we just repeat the TlsData load sequence, since it's very short. Delta: -1 load

Comment on attachment 8949103[details][diff][review]
4.stub.patch
Review of attachment 8949103[details][diff][review]:
-----------------------------------------------------------------
LGTM! Some optimization ideas below.
::: js/src/wasm/WasmStubs.cpp
@@ +558,5 @@
> + // fallthrough:
> +
> + masm.bind(&storeBack);
> + masm.boxDouble(scratchF, scratchV, scratchF);
> + masm.storeValue(scratchV, Address(sp, jitArgOffset));
Some perf suggestions:
* If this path is hot (maybe the int32 -> double case?) we could optimize it to do masm.storeDouble with our current boxing format. We could add masm.boxDouble(FloatRegister, const Address&) for that.
* For the cases where you're storing a known constant, you could do something like this, to bypass the FP registers entirely:
masm.storeValue(DoubleValue(JS::GenericNaN()), Address(...));
masm.jump(&next);
The same applies to the null/undefined -> int32 path.
* Instead of boxing in a ValueOperand, then storing the ValueOperand, you might be able to use masm.storeValue(JSValueType, payloadReg, Address) defined here:
https://searchfox.org/mozilla-central/rev/a539f8f7fe288d0b4d8d6a32510be20f52bb7a0f/js/src/jit/x64/MacroAssembler-x64.h#121

Comment on attachment 8949104[details][diff][review]
5.use-jit-entry.patch
Review of attachment 8949104[details][diff][review]:
-----------------------------------------------------------------
Very nice, some nits below. I'll go over this a second time in a few hours and r+ if I don't see anything serious.
::: js/src/jit/BaselineIC.cpp
@@ +1719,5 @@
> // Attach a stub if the script can be Baseline-compiled. We do this also
> // if the script is not yet compiled to avoid attaching a CallNative stub
> // that handles everything, even after the callee becomes hot.
> + if (((target->hasScript() && target->nonLazyScript()->canBaselineCompile()) ||
> + (target->isNativeWithJitEntry())) &&
Nit: remove the parentheses around target->isNativeWithJitEntry()
::: js/src/jit/IonCacheIRCompiler.cpp
@@ +1086,5 @@
>
> + // The getter currently has a jit entry or a non-lazy script. We will only
> + // relazify when we do a shrinking GC and when that happens we will also
> + // purge IC stubs.
> + MOZ_ASSERT(target->hasJitEntry());
Nit: the comment says "has a jit entry or a non-lazy script" but we assert just hasJitEntry(). So maybe we should say "wasm entry or a non-lazy script"?
@@ +2109,5 @@
>
> // Check stack alignment. Add sizeof(uintptr_t) for the return address.
> MOZ_ASSERT(((masm.framePushed() + sizeof(uintptr_t)) % JitStackAlignment) == 0);
>
> + // The setter currently has a jit entry or a non-lazy script. We will only
Same here.
::: js/src/jsfun.h
@@ +625,5 @@
> + MOZ_ASSERT(u.wasm.jitEntry_);
> + return u.wasm.jitEntry_;
> + }
> +
> + // AsmJS non optimized ctor store the func index in the jitinfo slot, since
Nit: s/ctor/ctors/?

Thanks for the quick review and feedback! A lot of great catches and comments here, all applied except for:
(In reply to Luke Wagner [:luke] from comment #55)
> ::: js/src/wasm/WasmFrameIter.cpp
> @@ +899,5 @@
> > +#if defined(JS_CODEGEN_ARM) || defined(JS_CODEGEN_MIPS32) || defined(JS_CODEGEN_MIPS64)
> > + if (offsetFromEntry < PushedRetAddr) {
> > + // We haven't pushed the jit return address yet, thus the jit
> > + // frame is incomplete and we won't be able to unwind it; drop it.
> > + return false;
>
> Why do we need to lose this frame? It seems like we aren't using
> returnAddress anywhere here and, if we were, we could get it from state.lr.
So the next thing that happens after this in profiling frame iteration is that the wasm::ProfilingFrameIter notices it's in a jit entry, then stores the unwound fp, then it is passed to the jit::ProfilingFrameIter, but that one can't unwind the frame because it's incomplete. So it's better to drop the frame at this point. I'll tweak the comment to describe this better.
> ::: js/src/wasm/WasmStubs.cpp
> @@ +392,5 @@
> > +}
> > +
> > +// Load instance's TLS from the callee.
> > +static void
> > +GenerateJitEntryLoadTls(MacroAssembler& masm, unsigned frameSize)
>
> Since it's so short now, how about inlining it where it's useful to see the
> context (where we are relative to the prologue, etc).
Disagreed: it's a few lines of code that would get duplicated (it's called twice), and any ctags-like system will make it very easy to jump to the function and get back where you were in the first place.
> @@ +463,5 @@
> > +
> > + GenerateJitEntryLoadTls(masm, frameSize);
> > +
> > + if (fe.sig().hasI64ArgOrRet()) {
> > + masm.call(SymbolicAddress::ReportInt64JSCall);
>
> Instead of having ReportInt64JSCall and this special case, could we have
> CanBeJitOptimized() return false when hasI64ArgOrRet?
Yes; that means some churn in jsfun.h as well (which is in the next patch), so I'll just drop this change in the final folded patch.
> ::: js/src/wasm/WasmTypes.h
> @@ +1053,5 @@
> > enum Kind {
> > Function, // function definition
> > InterpEntry, // calls into wasm from C++
> > + ImportJitExit, // fast-path calling from wasm into jit code
> > + JitEntry, // calls into wasm from jit code
>
> nit: How about having the order be: InterpEntry, JitEntry, ImportJitExit,
> ImportInterpExit?
Even better: interp entry / jit entry / interp exit / jit exit, so "jit" always follows "interp".

Implemented your suggestions for the stub optimizations, asking for review since it's touching Assembler/MacroAssembler files for the new boxDouble variant. (it's a small loss for ARM, which is unfortunate, but oh well...)

(In reply to Benjamin Bouvier [:bbouvier] from comment #58)
> > Since it's so short now, how about inlining it where it's useful to see the
> > context (where we are relative to the prologue, etc).
>
> Disagreed: it's a few lines of code that would get duplicated (it's called
> twice),
Ah, I didn't see it was called twice.
> > > + JitEntry, // calls into wasm from jit code
> >
> > nit: How about having the order be: InterpEntry, JitEntry, ImportJitExit,
> > ImportInterpExit?
>
> Even better: interp entry / jit entry / interp exit / jit exit, so "jit"
> always follows "interp".
Oh hah, yes, that's what I meant to type :)

This implements Luke's suggestion to remove the fast path if int64 is in the signature; not much really wasm specific, and it touches more jsfun changes you've reviewed earlier.
It's a bit wacky because a JSFunction now has to distinguish between:
- asm.js vs wasm
- optimized vs not optimized
and all four combinations are possible. To avoid introducing a new JSFunction::flag (we're running out of space!), I decided that testing for isWasmOrAsmJS() would consist in testing that the native is WasmCall() (true for both asm.js and wasm).
When we get rid of asm.js optimizations, this will become much simpler (we could replace the AsmJS kind by a Wasm kind at this time).

Comment on attachment 8949368[details][diff][review]
6.optimize-stub-convert.patch
Review of attachment 8949368[details][diff][review]:
-----------------------------------------------------------------
Looks great, but I think we can simplify boxDouble a lot by defining it in MacroAssembler-inl.h and forwarding to storeDouble? We only want a separate method name to document we're boxing a Value (in case the boxing format changes).

Comment on attachment 8949398[details][diff][review]
7.remove-i64-fast-path.patch
Review of attachment 8949398[details][diff][review]:
-----------------------------------------------------------------
Looks reasonable and I like the code removal, but I'm a bit confused by the different asm.js/wasm optimized/unoptimized cases, see the comment below.
::: js/src/jsfun.h
@@ +190,5 @@
> bool isNative() const { return !isInterpreted(); }
>
> bool isConstructor() const { return flags() & CONSTRUCTOR; }
>
> /* Possible attributes of a native function: */
This is becoming complicated and I don't fully understand all cases. It would be great to have a comment here explaining the different types of natives, something like:
* Builtin: ...has field X, flag Y...
* Wasm optimized: has field A, used for B.
* Wasm unoptimized: has field C, used for D.
* Asm.js optimized: ...
* Asm.js unoptimized: ...
It would also be good to explain how isAsmJSNative and isBuiltinNative map to these different kinds.

As discussed with Luke:
- we keep the optimized path even for i64 functions
- asm.js don't get optimized, to give an incentive to move to wasm and to keep the code simple and clean.
The latter really simplifies jsfun.