Thursday, May 26, 2011

Four-ever, yours

Let's just come out and say it: after a number of sleepless days struggling with it, it does not look like there will be a "TenFourFox 5." Before you panic, that does not mean that TenFourFox will be unsupported or will not get new features! It just means we won't be building from the post-Mozilla 2.0 codebase anymore, and there is a bright silver lining to it.

Before we do that, let's post-mortem Firefox 5 on 10.4.11 PPC first. This is nerdy, so if you just want to skip on to the new revised roadmap, go for it.

As explained in previous entries, my main concern was that Mozilla as of Mozilla 5.0 requires Chromium IPC to be built (IPC is now required), and IPC requires building a giant library called libxul. In Mozilla 2.0/Firefox 4.0, this was optional, and we built without them because Chromium IPC requires lots of 10.5-only OS functions and libxul was too big to link with the Tiger 32-bit linker.

It is possible to get Chromium IPC to compile "just enough" to make Mozilla's IPC happy, because Mozilla's IPC code does not use very much of Chromium at all. Chromium includes lots of system abstraction code, which Mozilla doesn't need because it uses NSPR, and as a result almost all of the stuff that didn't build can be safely stubbed out with empty functions. The exception is the launch code, which requires posix_spawn and friends (which 10.4 doesn't have either), but even that right now is only used by out-of-process plugins which were marked for death in "10.4Fx 5" anyway. It would be a problem when Mozilla goes to full one-process-per-tab, but that is not imminent, and we can work around that too. So that much is doable, and does compile.

Linking, however, turned out to be hugely troublesome. Tiger has two linkers, the regular Darwin /usr/bin/ld which is full-featured but only 32-bit, and the 64-bit /usr/bin/ld64 which can deal with larger files but lacks many important features. In particular, a number of the indirect linking features Mozilla requires for dylibs are not supported in ld64, though they do give just warnings instead of errors. The reason this is important is that the linker can only deal with libraries that fit within its addressing space, and even with optimization and stripping libxul is too big to fit in 32-bit space.

The first problem was Apple's libtool: it doesn't let you choose your linker. To get around this, I wrote a shim Perl script to act as a switchable ld, calling ld32 (the renamed original /usr/bin/ld) ordinarily except if LD64 is set to one in the environment. I then hacked the Python build argument generators to call another Perl shim that looks to see if libxul is being linked or linked against, and turn on 64-bit linking if so. That much works.

The next problem, however, was that debug libxul can't be linked even by the 64-bit Tiger linker: it crashes with a malloc() error, strongly suggesting this is an operating system limitation. The size of this library, by the way, approaches 750 megabytes and I'm not even exaggerating. Since I doubt Mozilla would distribute an app that large :), something is already wrong.

But that wasn't all. For G4 and G5 optimized builds, the WebM accelerator assembly code requires special linker settings and ld64 doesn't support them either, so they couldn't be built as is. (ld32 couldn't even start, by the way.) Losing this code would be a serious blow to us, but would be doable if we could get the basic browser at least to stand up.

Unfortunately, we couldn't. The G3 version did, finally, appear to link in ld64 with a crapload of warnings, but when I started it up in the debugger it crashed immediately in darwin_gcc3_preregister_frame_info. This is a very low-level symbol in the C runtime (crt1.o) and suggests that dyld could not resolve the links. This could be for any number of reasons, including the app has to be in 64-bit mode for dyld to resolve references in a library of that size in 10.4 (a show-stopper), dyld can't handle a library of that size at all in 10.4 (also a show-stopper), the linker actually did fail and didn't realize it (also a show-stopper), or I munged the code while making it acceptable to the compiler to such an extent that it can't enter its main loop (and since we can't build the debug libxul, we have no good way of debugging it). It could even be all of the above.

Mind you, I'm not totally admitting defeat, but I am pretty confident now that we cannot build Firefox 5 as written on 10.4.11 PPC, at least not with our current tools. It should be possible to build it on 10.5.8 PPC, and there are some early builds on El Furbe of 4.2 (which was 5's original version number), but I don't know how well these work and his buildbot seems to have stopped again. That doesn't help us, though, because we support G3 Macs and 10.4, and we always will. More to the point, I personally only run 10.4 and can't support a "TenFiveFox." If someone out there wants to apply the TenFourFox enhancements and run off 10.5-only builds, I will gladly direct people their way, but here we stay on 10.4.

I know Tobias and some others (David Fang?) had built some custom tools themselves. I am doubtful that the tools are the actual problem (I suspect the real issue is with 10.4's dyld), though I can't test that myself because I don't want to corrupt my working build system with unofficial tools, but here's your chance to put your money where your mouth is. If you can get it to link with your custom linkers, you will have done us all a great favour and win our userbase's ever-lasting regard, plus proven that they work. Remember, it's got to work in 10.4 to qualify, but as long as the browser can start, there is hope. Here's how:

Make sure your system is set up to build the current TenFourFox as a prerequisite (see the wiki) and that it can, in fact, build TenFourFox 4.0.

Clone the mozilla-beta repo.

Download the current set of Firefox 5 patches and serially, in numerical order, hg import them into the repo.

Move your /usr/bin/ld to /usr/bin/ld32, and move SHIM.ld in the repo root to /usr/bin/ld (you need to do this as root, obviously). You don't need to undo this at the end unless you want to as it is designed to be transparent. Replace this with your custom linkers as indicated, of course.

When Firefox 5 comes out, it will move to mozilla-release, and then in 6 weeks be obliterated -- no more numerical branches. At that point, assuming no one has cracked the problem we discussed above, we drop source parity and continue our development purely from Mozilla 2.0 at feature parity state.

First, let's discuss what this means practically. Camino is very popular with PowerPC users, and 2.0 is based on Firefox 3.0 (Mozilla 1.9), which is three full generations behind TenFourFox. Despite this, it still manages to get the job done for a great many things; I myself am a former Camino user. Even Camino 2.1 will "only" be based on Firefox 3.6 (1.9.2). So even if we just phoned it in and did only security updates, functionally speaking TenFourFox would still be generationally ahead for several years. Remember how long WaMCom usefully persisted on OS 9 with nobody at all updating it?

But, of course, we will do feature updates (since we will be at feature parity), and because we are very close to Firefox 5 we can adopt quite a bit of the source code with only minimal changes and risk. We do need to pick features and support them with fixes, but the process should be much cleaner and less iffy than the case with Classilla where perilous backport glue code is sometimes necessary (and not always successful). We cannot take code that depends on IPC, of course, but layout, content and media should work, and some careful graphics updates. However, we should not mess with the UI except if it's buggy, because this may require saddling our slower users with more baggage than their computer can handle and the current browser at least is a known quantity. We will also wait on JavaScript updates in case they break the nanojit, and IonMonkey is in the future, so we should see what turns out with that.

And, since we aren't moving to TenFourFox 5, updates will still function properly in the way they are, and the plugin code won't change. So you get to keep plugins after all. I hope you're happy. ;) My arguments over security still stand, however, and I still strongly recommend something like Flashblock.

This is the new feature parity roadmap; my intention is to have a release every six to eight weeks, with four to six weeks of dev time and two weeks of "pre" testing:

4.0.2 still comes out probably this weekend. It looks very successful; no major problems were reported with the second beta, and the JavaScript speed and stability increase was universal.

4.0.3 is the next feature release. This will turn on the nanojit for the browser chrome and hopefully add load-store to the nanojit (these were features intended for TenFourFox 5 originally), along with configurable WebM filtering settings. It will also include security fixes and Firefox 5 bug fixes that we can safely backport. It may or may not include some new safe-looking features that do not significantly change the browser, such as the Firefox 5 changes and enhancements to HTTP and XMLHttpRequest. I might defer these to a '4.0.4'.

4.1 is where the magic will start. Among the features I think we need to have include Firefox 5's improvements to canvas, CSS animation, WebSockets (when they finish screwing around with the IETF spec, likely by Firefox 6) and SSL False Start. These are all features likely to be leveraged by future sites. Similarly, we should try to adopt some of their SSE/NEON vectorized improvements, especially for text scanning, along with libjpeg-turbo. These features are developing and we need to constantly watch for new bugs, so we should keep this set of new supported web features small and essential. If 4.1 is delayed, there will be a 4.0.4.

Because 4.1 changes the browser core in significant ways, we should not report ourselves as Firefox 4 anymore. We might adopt a built-in user agent changer like Classilla. More on that later.

So, the road moves on, just in a different direction. I'm going to start writing up new roadmaps and worklists in the meantime while 4.0.2 builds on the G5. And for the hacker builders, you've got eight weeks or so of Firefox 5 to see if you can get it running and succeed where I failed. In the meantime, TenFourFox remains "four-ever, yours."

13 comments:

Well, I'll try to build it as you indicated. But as I got my whole build environment upgraded to the version level of Xcode 3.2 I'll need to keep those build system hacks out.I've got ld64 (ld32 is deprecated and not any longer included with Xcode 3.2), libtool and everything else (except dyld of course because that is part of the OS) at the state of Xcode 3.2 . It's also possible to compile and link against (more) recent C runtime libraries, e.g. crt1.o, and as those are statically linked in that shouldn't cause incompatibilities.So I guess it might be possible.

No worries. I'd like to play with your ld64 if you have it somewhere to download and assuming (you said earlier) that it is linked only against the 10.4SDK and has no other deps. I don't want to update the compiler at the same time; too many variables, and I can work around the compiler issues. I just can't work around the linker.

Well now at least we know where we are. If FF5 doesn't build in 10.4, that's the way it is. I would be interested in a "Ten5Fox" for 10.5.8, though, although I'd much rather have a feature parity Ten4Fox with a working nanojit and everything that doesn't make 2/3 of my Macs (G3s) obsolete. The roadmap for TFF4.x looks promising.

Plugins *are* important to me – I use the Schubert PDF plugin every day. For what I do to earn money, it doesn't matter if it's html or pdf, as long as I can view it side by side in tabs in my production browser. Other plugins are less important, but convenient. And of course, I have flashblock installed.

There is a high probability that plugins would not work in Firefox 5, even if we got the browser working, because the asynchronous launch code can't be made to build due to the posix_spawn requirement. I toyed with rewriting posix_spawn in terms of execve and vfork and dup2, but you don't need to be as big a nerd as I am to realize how iffy that would be. Note I didn't say plugin compatibility would be impossible: Mozilla does offer a way to blacklist plugins from running out of process and I patched it to always say all plugins are blacklisted and must run in-process. At least in theory this shouldn't need the asynch launch code. However, for obvious reasons I have no idea if that will work.

Like I say, I have not totally given up on Fx5, but even if the linking problem is solved, there is no guarantee that the browser will actually function or be functional enough. So the fallback strategy is to keep working on Fx4 modulo that and continuing Mozilla 2.0 development is definitely viable because we are only one generation behind Fx5 -- importing and backporting is much easier. There are certainly advantages to staying on the same base we're on, because we know it works.

I look forward to Tobias' experiment with interest and I'll examine his linker when I get a copy of it to see if it is suitable.

My worry with dyld is that if it turns out to be a runtime requirement, it's too much to ask most of our users to install an updated one. dyld is a pretty low-level component and if it goes bad, we run the risk of messing up OS X (reversibly, but certainly not conveniently). Using a different linker is one thing because it only affects builders and then only one component of the system, but if ld doesn't solve the problem, I think trying to fix dyld is going to be more trouble than it's worth.

Regardless, I do appreciate the thinking and the attempt to help, and the kind word. :)

Here's the list of libraries ld(64) is linked against. Actually in Xcode 3 ld64 was renamed to ld.

tousas-powerbook1998:~ tousa$ otool -L /usr/bin/ld/usr/bin/ld: /usr/lib/libstdc++.6.dylib (compatibility version 7.0.0, current version 7.4.0) /usr/lib/libgcc_s.1.dylib (compatibility version 1.0.0, current version 1.0.0) /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 88.3.11) /Developer/usr/lib/libLTO.dylib (compatibility version 1.0.0, current version 4000.0.0)

Note that libLTO is lazylinked meaning that it's availability is optional. LTO stands for "Link Time Optimization" which can be used be LLVM-based compilers like llvm-gcc and clang. So we don't need it for gcc. I verified it actually works with both libLTO available and not available.

I sent the tarball of the compiled ld64 package to ClassicHasClass per E-Mail.

Note that just updating ld might not help very much as libraries are usually linked with libtool, ranlib or ar which are part of the cctools package.