I realize it's not the focus of his test, but as someone who thinks often about how to take advantage of advanced vectorization techniques on modern processors, I was surprised by statements like this:

Moreover GCC -O2 defaults are (in my opinion unfortunately) still not enabling vectorization and unrolling which may have noticeable effects on benchmarks.

This led to enabling AVX and since the global constructor now gets some code auto-vectorized the binary crashed on invalid instruction during the build (my testing machine has no AVX).

No AVX? He wants to better take advantage of vectorization, but he's doing the testing on a processor that is 3 generations behind in vectorization support. AVX (128-bit) come out in 2011, and has been followed by AVX2 (256-bit) and (still limited release) AVX-512.

Clock speeds have been fairly flat, and most of the improvements to recent processors have been microarchitectural. A lot of the optimization done by compilers ends up being architecture specific. Seeing which brand-new compiler best targets old hardware seems like it might produce misleading results.

I realize that not everyone has (or can have) the most recent hardware, but this seems like a case where it would be strongly in AMD and Intel's interest to make sure that people like Jan have better access to the improvements made in the last few years.

Intel still disables AVX instructions on their low-end Core architecture chips for market segmentation purposes. It is also entirely absent to begin with from their Atom and Celeron chips. AMD did not have AVX support until Ryzen, but they are still selling Piledriver based CPUs on their AM4 platform.

Firefox can't blindly use AVX without checking for its presence or it will crash on these types of systems.

You're right, and this is a little confusing. The article says he's using a "AMD Opteron 6272", which seems like it should support AVX: http://www.cpu-world.com/CPUs/Bulldozer/AMD-Opteron%206272%2.... So maybe the GCC bug he encountered is actually because he lacks AVX2 support? Or an incompatibility between early AMD and Intel support for AVX?

The author is trying to figure out why GCC underperforms Clang for Firefox builds. Seems inappropriate to worry about cutting-edge features that probably are available to 10% or so of computers that run Firefox (or a competing browser.)

Are you suggesting that 10% is a reasonable estimate of the percentage of Firefox instances running on hardware that supports AVX? I'm not sure of the real numbers but this strikes me as an extremely low estimate.

The closest I can find is Valve's hardware survey, which says that 87% of users are running on computers that support AVX. https://store.steampowered.com/hwsurvey (click on "Other Settings" on the bottom to expand).

Firefox is presumably lower than this, although I don't know by how much. Does Mozilla collect statistics on the capabilities of the computers that Firefox is being run on?

You can look at https://data.firefox.com/dashboard/hardware for similar numbers. CPU models are not broken out, but you can get a sense of what sort of CPUs are in use by looking at the GPU model: Intel powers 60%+ of the sampled population.

- first how you set -O2 defaults in your compiler. This is a delicate problem since you need to find right balance of code size, compile time, robustness of generated code (do not trigger undefined effect in super evil ways) and of course runtime. In benchmarks I have found that Clang has bit of edge for runtime which is mostly vectorization (on x86-64)

- selection of minimal ISA you support. For GCC x86-64 is still the original Opteron, but distributions can easily (and some do) decide for better. Indeed AVX is big win, but for general purpose distribution this is still too agressive. You can provide AVX optimized libraries where it depends

- selection of CPU tunning (i.e. generic/intel)

So I consider it mistake that GCC traded vectorization over compile time speed+reliablity for -O2 because it can make important difference in common workloads this days (not 10 years ago, say).

It is also clearly a bug for GCC to produce AVX instruction when not explicitly asked for :)

I also do testing on Zen, Core and some PowerPC. For the firefox machine I use Buldozer box because I don't care it spends long nights running builds & benchmarks and I think this particular problem is not very CPU specific.

Does GCC introduce multiversioned AVX functions on its own initiative?

Yes, this seems like one reasonable approach. The current approach of compiling to a "least common denominator" and then updating this denominator every decade or so seems insufficient. Instead of interpreting the absence of a "-march=" flag to mean optimize for nothing, maybe it can mean that multiple optimizations are automatically compiled and the appropriate one is selected at runtime. Alternatively, maybe we need to move away from the idea of compiling a single binary that runs on all platforms, and encourage greater use of platform specific compilations.

Performance improvements have diminishing returns though, and for a consumer product like Firefox, a large number of users will be on older hardware. Ceteris paribus, it would be much better to get say a 10% improvement for the users on below-median hardware than for the users on above-median hardware since it will matter a good deal more for the former group.

You're right, sorry. I concentrate on integer operations, which are mostly restricted to 128-bit on AVX, and had forgotten that it also supported 256-bit floating point. I'm not sure which (if either) would be most relevant to a web browser. I hope this error doesn't distract too much from my overall point.

I'm surprised that the GCC doesn't add some autodetection mechanism in it.

As 'hubicka' mentions in another comment, GCC does have "multi-versioning" capability, but it doesn't use it by default. Instead, one needs to mark individual functions with GCC specific attributes, asking for versions with different capabilities to be created: https://lwn.net/Articles/691932/. This isn't necessarily a bad approach, but the fact that it seems be used rarely makes me wonder if some other more default approach that works with unmodified code might be an improvement.

I'm not loving the Firefox motion toward clang. For years we've been told that clang is great because we finally have a competitor for gcc and that multiple interoperable compilers can only improve the ecosystem (which is undeniably true).

Now we have a big project deciding to move from a reasonably portable gcc build to a clang-specific LTO framework that required significant engineering effort to achieve and which apparently isn't easily portable to the equivalent gcc effort, requiring a gcc maintainer to jump in on their behalf to show equivalence.

The advantage of having multiple C compilers has always been broader platform support and competition to improve compiler speed, error messages, and codegen quality. Conspicuously, one complete non-advantage of having multiple C compilers is portability of a codebase between compilers; the C specification contains too much implementation-defined behavior (including undefined behavior) and is too anemic (requiring compilers to come up with nonstandard extensions to support things like assembly and SIMD) for compiler-portability to be anything other than a nightmare for large projects. C projects even have a hard time upgrading to new versions of the same compiler, which helps to explain why so many shops are still using positively ancient versions of GCC.

I had the good fortune to work on a code base that started as a blank emacs buffer and was compiled with both g++ and clang++ from the get go (and built on multiple platforms, with maximal warnings enabled). The two compilers surfaced different bugs in our codebase which, in addition to finding actual (semantic) bugs has been quite valuable for portability and maintainability.

(Also, as others have noted, the existence of clang has really been good for gcc).

Yeah, but to be fair the work to actually enable LTO was very significant (at least as far as we outside the community could see via stuff like the blog post here) and involved a ton of toolchain-specific hackery and work with the clang upstream.

Given that same level of effort (c.f. the article we're discussing) it seems like you could have done as well or better by moving to a more recent gcc instead. Or better, by working with both at coming up with a portable way to get LTO working.

I'm not really concerned with what you use to build (I mean, you have to pick some compiler at the end of the day), just with what seems to be "needlessly tight coupling" between clang/llvm and Firefox in a way that hurts the interoperable toolchain ecosystem.

What are you referring to by "a ton of toolchain-specific hackery" and "a portable way to get LTO working"? It seems like there are very specific things you have in mind, but I'm unclear what bits of work you're referencing. Unless you're thinking of the cross-language LTO work, which is still in progress and is of course clang/llvm-specific? I'd love to see that feature work with GCC, but it's simply not feasible at the present time.

Regardless, that feature being enabled when you're using suitable versions of clang/llvm/rustc doesn't preclude using LTO with other compilers.

"And even worse, it's generating worse and ridiculously bigger code. So they spend a lot of effort for worse result."

GCC aggressively size optimizes cold regions, and LLVM doesn't bother.
This would be pretty easy to fix, but outside of binary size, need to prove it actually matters (the test harness here is a pretty darn old CPU)

The "don't catch serious problems in build system" is the bigger worry, imho.

I have re-tested on my skylake notebook and updated the blog. It confirms darn old CPU I use as my benchmark machine. Maybe it is bit more sensitive to the difference which is expected for non-server CPU.

GCC does "almost full LTO" with partitioning, while clang does thinLTO that does make most of code size/speed tradeoffs without considering whole program context, so it may be interesting to get both alternatives closer in code size/performance metrics.

I have got Firefox developer account of level1 and I am looking into official benchmarking architecture which I have now updated to GCC 8 with LTO+PGO.

A goal that is being worked towards is making LLVM thinLTO not just consider clang output but to consider clang-generated LLVM IR together with rustc-generated LLVM IR. This is expected to lead to inlining between C++ and Rust making the FFI layer of C-linkage function calls melt away between C++ and Rust.

One of the things that disturbed me from the article was how the Mozilla build chain just merrily ignores that profiling had failed, and moves on building stuff using that profile. That seems like quite dangerous behaviour. Surely that should be a failing step for a build, or at the very least a large warning should go out at the end "This build may be optimised based on complete nonsense because profiling failed"

> All this to render web pages. I think we must have made a wrong turn somewhere.

We've taken plenty of wrong turns, but none, I think, accounted for more than a rounding error in time or code needed to render web pages. Writing a browser is hard.

Hell, even writing a toy browser-like mockup isn't easy. I built an extremely bad renderer for an extremely simple class-provided XML-ish grammar in school. It only supported a handful of styling keywords (all inline/attribute-based), only one of which was positioning-related ("wrap to next fixed-height global line of display after this element").

It was really hard. Like, really hard. Even looking back on the code with the benefit of experience, it still would not be a breeze.

It supported a single fixed window size and a guaranteed-correct input file. Removing either of those constraints would have exploded the code size to the point I doubt I could have done it alone then, and if I could now it would take me an incredibly long time. Adding the full HTML spec would probably bring its SLoC counts into the 100ks, if not millions. Supporting re-renders and after-the-fact DOM updates would blow it far beyond that, making them fast might require me to go back to school, but who knows; maybe it's easier than my hunch. I suppose I could shave time by moving some of those hundreds of thousands of lines into the libraries which evolved during the many years since browsers became popular, but it would still be a gargantuan undertaking.

And all of that is before the immense amount of person-hours which would be needed for:

- Supporting cascading styling of any kind, with or without embedding another language.

- Securing the request/response protocol, even if leveraging existing tools like OpenSSL to the max.

- Adding another turing-complete and secure programming language for communicating with random local/networked resources and producing more requests or DOM updates.

It's hard.

TL;DR There are plenty of needlessly-complex tools and technologies out there. But I don't think web browsers are some of them. Even if you're anti-JS and anti-CSS, there is still an absolute shitload of complex, careful, hard-to-get-right interactions going on under the hood.

(As an aside, The loveliness of Elm is incidental to the point. If it looked like COBOL it would still make economic sense. Lots of people have developed DSLs for apps, the important thing about Elm is that it's a very elegant and well-thought-out domain-specific system for specifying apps. Elm is much less complex than HTML+CSS+JS+Frameworks/libs/NPM etc.)

At the moment, the delivery vehicle for Elm-specified apps is the Fractal Rube Goldberg Machine, yes.

But consider e.g. an Elm-to-GTK compiler, or Elm-to-TCL/Tk interpreter, whatever... The FRBM is just a reasonable first target platform.

I don't think I'm wrong here, or even saying anything controversial. Go look at what VPRI did with STEPS. Our code volume and complexity is too high by two or three orders of magnitude.

I don't see why I wouldn't just use Haskell and a native UI library right now, to similar effect, instead of waiting for all this to appear. The language is in a much more stable state than Elm, which already makes it more ideal in a business-context.

I'm not making your point for you, I just don't agree with you on what the actual problem is.

Programmers don't disagree that we could be using better approaches. The question is not why they don't exist, because they do exist. The question is why we don't or can't use them currently.

Most of the time, the reason is purely cultural, either due to management or legacy. I'd love to use Purescript and Haskell at my job, but I cannot. I don't get to make that choice. A new Elm transpiler won't solve a cultural problem.

I wouldn't run testing on a notebook. I mean you can if what you are possibly testing is boosting characteristics and other variables... But best bet for low variable consistent testing is a machine where you have set a static core clock speed and have disabled c-states and other power save things. Remember, you are testing differences in compile optimization. You don't want your system being a variable.

My understanding is that profile-guided optimisation is largely based on the utility of small binaries, by optimising hotspots for speed and everything else for space, thereby alleviating cache-pressure. Is this wrong?

> You also won't easily predict the behavior due to reordering.

I wasn't thinking of anything as sophisticated as looking at specific flows, where I can well imagine things get unpredictable with reordering and speculative execution. Won't there will be a reliable pattern of better fitting in cache, if we shrink everything?

"Smaller binary enables use of a lower-level cache, no? [0]"
No.
It would if all of the stuff was actually all in memory at once, and pulled in the stuff next to it.

IE you couldn't pull in function A without pulling in function B.
That is mostly not true[1].

This is why reordering mostly brings load time benefits instead of run time benefits.

The utility of PGO is mostly about knowing where to spend your time optimizing, and knowing what to do. That's a generalization.
There are certainly cases in inline/etc heavy code where it helps get the speed part right too. A lot of that is more often about "it lets the compiler spend it's inlining budget on inlining stuff that matters" than "it stops the compiler from blowing cache out".

I speak in generalities because there are always counterexamples.

There are cases where PGO makes things significantly worse, for example!

Last I remember (My job now means i don't have time to stay in the game), LLVM did not bother to optimize the cold regions for size, and GCC did.

[1] It depends on function sizes and page sizes and mlocking and section flags and all sorts of fun things, but i'm just going to assert the truth of this in most cases go make it simpler.

I would be interested to know what cache aware code layout optimizations are available in LLVM. I personally know of none. GCC is bit simplistic in this sense (it does reorder functions based on profile feedback and execution time) and I plan to change that for next stage 1 (i.e. GCC 10)

(Both are interestingly well behind what commercial compilers do, and this is one of the very few areas where that is true. My suspicion is that it does not matter as much in practice as we want it to. Most forms of layout optimization are also very hard to perform on the C++ code you want to optimize due to inability to prove safety)

Yep, I have code layout pass in my tree for a while, but because I was never really able to measure off-noise improvements it is not in the tree, yet. I hope to make more sense of it with help of CPU counters which improved over the time.

I'm not familiar enough with LLVM to really say, so I was just speculating: I vaguely remembered some kind of talk about cache optimisation and LLVM, so it's possible it was talking about the LLVM codebase rather than the passes available in LLVM.

I think I'm going to mess with this later myself. On Manjaro I usually compile Chromium with -O3 and -march=native with no mtune or any of that but I never benchmarked it against anything. I'll do the same with Firefox. This is on coffee lake BTW.

This particular workload does not make much difference between modern CPUs. I just tried the Sunspider benchmark on my skylake and it has similar outcomes as reported, but there is more noise since it is notebook

What I got is:
GCC 8 build: 333 +- 3.3%
Tumbleweed distro firefox: 352 +- 3.4%
Firefox 63 (GCC) official binary: 346 +- 5.6%
Firefox 64 (llvm) official binary: 342 +- 5.1%
but I do not completely trust the numbers as re-running the benchmark leads to different outcome each time