Basically, the guide leads to the compilation of every package in the system in the correct order.

And every package, including gcc and glibc, is compiled only exactly once. This is a huge difference to the many similar guides floating around in the forums, which suggest rebuilding the system up to 6 times "just to be sure".

I argued in that article that my method of only compiling each package once is no worse than compiling it multiple times.

What was missing was an explanation why I think that way.

And the purpose of this article is to provide interested readers with that missing explanation.

Myth 1: GCC gets better, the more often it gets recompiled

This myth is one of the reasons most alternative "compile entire system"-kindof guides emerge the new GCC at least three times.

The rationale behind this:

When you emerge the new GCC the first time, it will be compiled using your old compiler. That means, it will be compiled by a potentially worse compiler than the new one. (Assuming each new version of a compiler generates better code than the older ones.)

So the new GCC is recompiled by itself, in order to get a new GCC that has also be generated by the new GCC's code generator.

Now we have a brand new GCC, generated by itself. Well - not actually by itself: The first new version of the GCC, which compiled the second version, was actually compiled by the old compiler. And maybe that old compiler somehow interfered with the source code of the new GCC, so that the first new GCC does not generate exactly the same code as a new GCC (compiled by itself) would have... So, in order to be sure, let the second GCC compile itself again to create a third GCC. This third version should be used then as the new compiler.

So far for the lore.

From a theoretical point of view, the supporters of this myth are mostly right. It is indeed necessary to recompile a compiler three times in order to get a new compiler which has been compiled by itself and proven to generate reliable code.

But what those people obviously don't know, is that a single "emerge gcc" does all that automatically!

Here is the explanation: The GCC makefiles, which are triggered by a single "emerge" operation, perform a "three-stage bootstrap".

This bootstrap works as follows:

Firstly, the old compiler is used to compile a first version of the new compiler.

This first version of the new compiler then compiles itself, creating a second compiler.

The second compiler ist then used to compile itself again, creating the third and final compiler.

The second and third version of the compiler should be exactly the same: The have both been compiled by a version of the new compiler, using the same source code as input.

The only reason for creating the third compiler is to verify that the second and third compilers are indeed the same.

This means that the first and the second compiler both generated the same code.

It further means that the new compiler can compile a C program that behaves exactly as a C program compiled by the old compiler.

That is, the new compiler knows "C" as good as the old one did. (At least as much knowledge is required for the "test" source codes getting compiled. And the compiler's own source code, which is very large and complex, serves as the test subject here.)

If one knows the above facts about how the GCC makefiles work, it should be clear why the GCC generated by those makefiles is already optimal, and cannot get better by recompiling it over and over again: The old compiler generated the same code for the new GCC as the new compiler did when compiling itself. And even a third compiler, compiled by the new compiler, generated the same code. This will not change, no matter how many times gcc is re-emerged.

This is why it is useless to emerge GCC more than once. It will not get any better than it already is.

Myth 2: We need to recompile everything after a glibc update

Many people consider updating the glibc to have a similar effect as updating the GCC: As the new glibc may have updated header files and may operate on different internal data structures, all applications using those header files (= virtually all applications) must be rebuilt.

Although this concern may be correct for applications which used some "experimental features" of the old glibc, such changes will effect only very few applications.

This is because the glibc ABI (Application Binary Interface) is very stable and existing functions change rarely if ever.

Typically, a new glibc version will add additional functions / features, but do this in a way that older programs not using these new features will not be affected in any way.

The only exception from this may be bugs that get fixed. But applications typically do not depend on the existence of bugs in the C library, but rather assume there are none. (Admittedly, virus writers may depend on such bugs. Well, then that virus will no longer work with the new glibc. What a pity. But hey! If you are a virus writer, then you are using the wrong OS anyway!)

Another point to consider is that the glibc is usually dynamically linked to an application.

This basically means that the application contains some record which communicates "put in a pointer to the printf() function in the glibc library here so I can call printf via this pointer" to the dynamic linker.

As long as the established semantics of the printf() function do not change, this will work with any future glibc version as well.

Another reason why the glibc ABI is very stable, is that it is (mostly) based on well-established standards, such as the ANSI-"C" or POSIX standards.

Those standards define a set of functionality which will not just "change over the night".

Thus, any new versions of the glibc can assumed to be reasonably backwards-compatible with existing applications.

Which means, for most existing dynamically linked executables, a new glibc can be used as a "drop-in-replacement" for an older version against it was linked. Without being re-built or re-linked, that is.

And what about statically linked executables?

Well, as those executables already contain all the code they need (extracted from the old glibc at the time when they were linked), they will not be affected at all from a glibc change. They will happily keep running, using their embedded copy of the old glibc just as before. You only have to re-link those executables if you want them to use the new glibc routines. (This may be the case if the new glibc contains important bug fixes.)

However, there is one thing that can break ABI compatibility: Version scripts.

Using version scripts, a library author can explicitly define a new version of a library not to be compatible with previous versions.

But even in this case, it its not necessary to immediately re-link all executables using the new glibc: Just keep the old glibc version installed. Then the old applications can continue to use the old glibc, while newly compiled applications will use the new glibc. (I. e. a parallel installation, also called a "slot"-installation in Gentooese. Hmm. I wonder if Gentoo supports slot installation of glibc? If not, then we are in serious trouble in such a scenario. But then all the other guides will fail as well! Such a glibc upgrade can only be performed easily by using some sort of cross-compiler/cross-linker, which themselves are linked to the old glibc, but create executables linked to the new glibc.)

As long as appliations using those different library versions do not interact with each other on the ABI level, all will still be fine. (But plug-ins will typically no longer work when they are linked against a different version of the glibc than the application uses which they belong to.)

Summing up: It is not true that every application has to be recompiled/relinked as soon as a new the glibc version is available. It depends on whether the new glibc contains version scripts that are incompatible with the previous glibc. And even then, not all applications are affected, as long as both versions of the glibc remain installed.

And the question is: How often will this occur? Why should the glibc authors release a new version of the glibc that is totally incompatible (at version script level) with the old version?

One can expect such incompatibilities to happen if the major version number of glibc is incremented (i. e. as soon as we reach glibc 3.x).

But certainly not when updating from glibc 2.3.x to 2.4!

Such upgrades can safely be assumed to be drop-in-replacements.

The implication of this is: Most applications that depend on the glibc will keep running after a glibc update without requiring recompilation or even re-linking.

So far for the myths.

And now for the real problems!

There is indeed a problem when upgrading GCC versions 3.3 to 3.4 or from 3.3/3.4 to GCC 4.

But those are not glibc problems; those are C++ problems.

In fact, (as far as I know) glibc should be totally unaffected by C++ problems, because (the shared library) glibc is a pure "C" library.

In a "C" library, symbols (as visible to the linker) have very simple names. They are typically the same names as used in the "C" source code, prefixed by an underscore.

Thus, if you use the function printf in your "C" program, the GCC generates a reference to the symbol "_printf". The linker then finds this symbol in the glibc, and uses this.

This simple scheme is possible in C, because it does not support overloading. Or C++'s namespaces. Each single (and externally visible) function in "C" has a single, unique name.

In C++ things are not so easy. There may exist multiple overloaded functions which all have the same name, but have different argument types. Or two otherwise identical names exist in different namespaces.

For that reason, C++ combines the argument types and namespace names with the name of the function to generate a unique symbol name which is seen by the linker. However, the actual C++ types cannot always be used directly in literal form when combining them to form symbol names for the linker due to restrictions how a valid linker name should look like.

This leads to replacing certain characters in the type names by others and even weirder transformations to be performed on the type names, until the end result of the combined names typically looks like rubbish to an unaware user.

This is also known as name mangling.

And because C++ types can get pretty complex, the mangled names are also that complex.

But the worst thing is: There are countless possibilities how name mangling could be performed.

From time to time, compiler autors find "a better way" to do the name mangling, and the resulting compiler will then generate object files which cannot be linked against object files exporting mangled named from an older version of the compiler.

If this happens, the old object files have to be recompiled with the new GCC in order to use the same name mangling scheme.

But mangled names are not the only reason the C++ object files may be incompatible with older versions of g++ (and thus requiring recompilation).

C++ compilers also generate type information structures and exception information structures which will be referenced from the generated code at runtime (dynamic_cast, type_id, try/catch etc).

And similar to symbol name mangling, those information structures can also be implemented in various ways, and somtimes it may even be necessary (and not just for fun because it "looks better" in the opinion of the compiler authors) to change them in order to implement some new feature of the compiler.

No matter what is changed - name mangling, exception information representation or type information representation - in all those cases C++ programs which are using those C++ features must be recompiled after a GCC compiler update.

And it seems, such changes were made in GCC 3.4 as well as GCC 4.

From what we have learned above, we can conclude: When upgrading from GCC 3 to GCC 4, we need to recompile most C++ programs, but it would be unneccessary to recompile most C programs.

However, how can we know which Gentoo packages contain C++ programs and which ones do not?

Of course, one could look into the source code of each package and search for C++ files... but this cannot be implemented automatically in a safe way.

The problem here is that it is impossible to distinguish between C and C++ source files in a safe way.

Of course there are naming conventions, such as "a C file has file extension .c" or "a C++ file has file extension .cpp or .C or .cxx or c++", but that's all they are: Conventions.

Nothing can stop a weirdo C++ developer from using a .c file extension for his C++ source files.

And you cannot even tell from looking into the source files: Is a file containing the single line "extern int c;" a C source file or a C++ source file? It could be both.

Only the instructions in the Makefile determine as what language a source file will actually be compiled.

But what if our weirdo C++ programmer does not use a Makefile at all, but rather uses his super-sophisticated and absolutely non-standard Perl script for taking over the Makefile's job?

The conclusion is: You can't safely determine whether a package uses C++ or not by automatical scanning of any kind. (Well, perhaps SKYNET could. It may have got enough A.I. for that. But then we would have other problems than recompiling our systems I guess.)

It follows that the only way to be sure that all packages containing C++ source code will be compiled after an incompatible g++ upgrade, is to also recompile all the packages.

This also allows all packages to benefit from the better code generator of the new GCC, and is thus not per se an evil thing.

So, now we are back where we have started: We know that we have to rebuild each and every package of the system.

But with the information from the paragraphs above, I can now argument how to do it and why it will work that way.

How my upgrade scheme works

The first step of my guide is to do an "emerge --sync". This ensures that the latest package versions are known to portage.

Then an "emerge --update --deep --newuse world" is run. This ensures that we also have the latest versions of all packages installed, including those in "system".

It also means that tools like "flex", "bison" etc which might be used by the compiler Makefiles are available in the most up-to-date versions. The fact that those tools are still linked against the old glibc should not have any effect, as I have shown when deconstructing myth # 2.

Then the new compiler should be emerged. Note that the system still uses the old compiler as the default compiler!

But as I have shown above (myth # 1), this new compiler will be as good as it can be, irrespective of the fact that the old compiler is still the system default compiler. And that there is no point in recompiling the new gcc ever again.

The fact that the GCC contains some C++ libraries is special here: As the g++ will recompile itself using itself in the second and third stages of its three-stage bootstrapping, it will actually become the first application compiled with the new C++ mangling schemes etc to be compiled on your system ever!

This means it need not be recompiled again, because it has actually been the first package to be recompiled. The fact that it is not the system default compiler yet does not affect this.

However, the new compiler is linked against the old glibc. But this should also not affect it - see myth # 2 again: If a new version of the glibc becomes available later, it will be used as a drop-in-replacement, not requiring re-linking or even recompilation of the new GCC.

At this point, all tools are up-to-date, and the new compiler and the old compiler are both installed and operational.

Now my guide tells to change the profile using "esecect profile" if desired, as well as setting the new GCC as the default compiler using "gcc-config".

After this, I suggest a reboot just to be sure the changes made by env-update have been propagated through all the shells and processes in the system.

Then, it is time to run my generator script.

What it does is simple: It does both an "emerge --pretend --emptytree world" and "emerge --pretend --emptytree system".

In both cases Portage will output the packages already in the correct order to be rebuilt.

However, the packages from system should be rebuilt first.

Another problem is that Portage analyzes both emerges separately, which means that the output for "world" contains several packages which are also part of the output for "system".

My script therefore filters out those duplicates, and combines the two lists into one ("system" list items coming first).

While doing this, it removes GCC from the combined list: GCC ist already installed, and as explained before in this article there were no advantage compiling it again.

As a special provision, my script ensures that the most important packages are emerged first: linux-headers, glibc and binutils.

linux-headers are emerged first, because they are required for glibc.

glibc is next, because it is without question the most important library in the system. (If you have upgraded the system profile before, this will emerge the new version of the glibc already.)

And by emerging glibc as soon as possible, any of the packages emerged after this will already be linked against this new glibc.

And although the new glibc can mostly be used as a drop-in-replacement for the old one (see myth # 2), emerging it as soon as possible will eliminate the few exceptions where a drop-in-replacement would not work. (Only our virus writer from myth # 2 will still not be happy. Where are the old bugs to be exploited?! Perhaps he will relocate to Redmond now, and the Gentoo community will lose a member. What a shame.)

Now you might throw in "But linux-headers has not be linked against the glibc!"

But that is actually not a problem: The linux-headers package consists only of header files and does not generate any executable which would require linking.

Also note: Any packages emerged from this point on will be compiled by the new compiler and be linked against the new glibc.

Can there be any better? I say "no". There is no reason why the following packages should be recompiled more than once. They would again be compiled by the same new GCC and be linked against the same new glibc, yielding the very same executables as before.

Finally, binutils are emerged, because they contain the remaining essential system components required by the build system, such as "make", "as", "ld" and the like.

However, it should actually not be necessary to recompile them, because they existing versions of those tools already were based on the most current source code, and I bet they are all pure C applications not using C++.

But I may be wrong, and perhaps the profile update unlocked different binutil versions; so perhaps it actually makes some sense to recompile them. And not to forget that better code generator of the new GCC! So just let's re-emerge them also.

After those packages the filtered packages from the "system" list will be emerged, followed by the filtered packages from the "world" list.

My generator script will then write out a shell script containing all those emerges in the right order and with the right emerge options, and add a bit magic to allow incrementally building the packages (see my guide for more details).

That's it.

That's all my script does.

If you are truly paranoid, you can re-emerge GCC and binutils again after all of this:

Perhaps the changing of the system profile unlocked new versions for some of the binutil tools.

It then would no longer be true that the "emerge --update --newuse --deep world" as performed at the beginning of my guide updated binutils to the newest version.

This means, GCC has been built with older versions of the binutils, and now there are newer versions of the binutils installed.

But the same is true for binutils themselves: At the time they were recompiled (after glibc), the newest version of binutils from the old Gentoo profile was installed. After they were recompiled, the newest binutils version of the new Gentoo profile is installed. And perhaps those versions might be different.

By re-emerging GCC and binutils again one can be sure, that those packages are now being built using the newest binutil version in any case.

Why I consider this to be paranoid: Even if new versions of tools like "cp", "make", "sed" etc are used during the compilation of a package, apart from catastropic bugs I cannot see how this should make a difference in the generated executables. (At least no functional difference. Perhaps a new linker will write a different version number into some identification fields of the various ELF sections. But who cares.)

So far my explanation.

Comments welcome!

Greetings, Guenther

Last edited by Guenther Brunthaler on Sun Sep 03, 2006 11:33 pm; edited 1 time in total

If I just provided a guide without that, wildly alleging things without explaining why I think those ways, I just left the readers in a position where they can either believe my allegations, or not.

But my intention has always been to create knowledge for the interested ones, not just belief. (I'm neither that company in Cupertino nor that from Redmond. I'm not interested in creating believers. I prefer to communicate with people who know what they are doing, and why.)

But my intention has always been to create knowledge for the interested ones, not just belief. (I'm neither that company in Cupertino nor that from Redmond. I'm not interested in creating believers. I prefer to communicate with people who know what they are doing, and why.)

there already exist scripts, which do this better - while this particular flow of logic, could make your system run, it will not optize it or make it stable.
For example, when you install new version of binutils, you need to recompile glibc and not gcc - as the dynamic loader is in glibc and not in gcc, failing to do so you are probably missing new features.
I'll not continue to dive into this, but the general upgrade guide has shown that it induces less bugs than some new methods, and there is emwrap, which probably does this much better._________________"I knew when an angel whispered into my ear,
You gotta get him away, yeah
Hey little bitch!
Be glad you finally walked away or you may have not lived another day."
Godsmack

(GNU)make is not part of binutils. It's installed as a separate package.

You are right; thank you for the correction. I didn't know that... well, at least I didn't think of that. (Because now after you told me I rememember very well to have been watching when "make" was emerged some time ago. Must have been advanced stupidity making me forget about it... )

Fortunately, "make" is a very stable tool regarding backwards-compatibility and does not generate any code on its own (in contrary to "binutils" which contain the Assembler and Linker), so it will not invalidate my guide if an oder version of make might be used for some of the first packages.

Guenther: Just as a general rule of thumb, it's a good idea to write in full paragraphs. One sentence per line is extremely annoying to read.

I did write full paragraphs! However, some paragraphs are so short, that they only fill a single line.

My general idea about using paragraphs is: A new idea or concept - a new paragraph. Unfortunately, there are lot of concepts to be communicated in the guide, leading to multiple rather short paragraphs.

I'm sorry about that, but I have no good idea how to change this with tolerable effort. Probably the best thing I could do is to re-write the guide from scratch, with focus on a better writing style from the very beginning.

But as the guide is still in a changing state, I will certainly wait for it reaching a stable state before I actually consider doing this.

Not that I don't believe you regarding the 3-stage gcc build (I knew that long ago), but do you know how to proof that this actually is also true in (Gentoo) practice?

Or in other words: is there a way to compare the generated gcc libs and binaries after an "emerge gcc" and after an additional "emerge -e system"? I tried but failed, probably because some sort of irrelevant (time?) information is stored along with the machine code (or because the 3-stage compilation produces different results depending on the installed glibc...).

but do you know how to proof that this actually is also true in (Gentoo) practice?

I admit I did not try to verify this "the hard way" as you did.

But as far as I could see, the Gentoo ebuild script just runs the autoconf-created installer of GCC, which *does* that tree-stage build. Normally.

However, I have not checked whether the Gentoo devs have perhaps patched the GCC build script to omit some of the GCC build stages.

But why should they? Especially on Gentoo, the compiler is more important than most other system components. I doubt the Gentoo devs would have crippled its build process just for a small speed gain, risking unstable GCC operation afterwards.

Considering this, I guess we don't need to bother seriously, regarding that issue.

Cinquero wrote:

Or in other words: is there a way to compare the generated gcc libs and binaries after an "emerge gcc" and after an additional "emerge -e system"?

Not easily.

Cinquero wrote:

I tried but failed, probably because some sort of irrelevant (time?) information is stored along with the machine code

Yes.

In fact, I tried to do the same a couple of years ago when I was working as a contractor for IBM.

In that past project, there was a rather complicated build chain involved, and a QA guy wanted me to check whether the binaries were identical after some subtle changes to the build chain.

To make it short: It was impossible.

I encountered the same issues you wrote about, and also additional ones.

It is true that compiler and linker add a lot of tags, version numbers, date/time information and the like to executables they produce.

But even worse, I had to learn that the compilers occasionally generated different code for equivalent source text, only differing in comments!

I even disassembled the generated code to learn more about those differences.

As it showed, the code was functionally identical, but for no appearant reason some of the functions just exchanged their order in the generated object file. (BTW, this was code generated by MS Visual C++ 6. Perhaps they let a pseudo random generator, seeded by a hash of the source file, determine the order in which functions are emitted into the object file? Or perhaps it's just a bug. Or perhaps, as M$ usually likes to phrase it, "This behaviour is by design".)

Cinquero wrote:

(or because the 3-stage compilation produces different results depending on the installed glibc...).

This is theoretically possible, if the new clib defines different macros in its header files, which will then expand to different (but typically functionally equivalent) code.

However, I doubt this: Aside from fixed bugs, the libc ABI is very stable.

New versions of glibc typically add functionality, but will not effect the old functionality as used by existing applications.

that for testing purposes some of the gcc devs implemented build options to make the output comparable/stable....

Perhaps this might even be the case! GCC has such a plethora of documented options ... perhaps there are also a couple of internal, otherwise undocumented options exactly for that purpose! Who knows...

In past compiles, I have seen GCC and it does do the three stage compile. That's why it takes so long to build. To prove it you can DL gcc from the GCC homepage and build it manually and compare the compile times with genlop or something.

In past compiles, I have seen GCC and it does do the three stage compile. That's why it takes so long to build. To prove it you can DL gcc from the GCC homepage and build it manually and compare the compile times with genlop or something.

Cheers.

Does it compile itself 3 times or simply bootstraps itself on three stages - I don't believe that gcc build itself more than 1 time, although never looked at the compilation output, the compilers bootstrap themselves - I also don't see any sense for a compiler to build itself more than one time..._________________"I knew when an angel whispered into my ear,
You gotta get him away, yeah
Hey little bitch!
Be glad you finally walked away or you may have not lived another day."
Godsmack

Does it compile itself 3 times or simply bootstraps itself on three stages - I don't believe that gcc build itself more than 1 time, although never looked at the compilation output, the compilers bootstrap themselves -

I also don't see any sense for a compiler to build itself more than one time...

Rebuilding a compiler itself is a real good sanity check. If a complex thing as a compiler can rebuild itself, than it must be a working one. If it can't, there must be a bug in the current or previous compiler._________________Alle dingen moeten onzin zijn.

this does not show that the compiler was compiled three times, and even less shows that the compiler has not bootstrapped itself in three stages, instead of building it three times...
so searched google for gcc bootstrap build:
one of the results

Quote:

Building a native compiler

For a native build issue the command `make bootstrap'. This will build the entire GCC system, which includes the following steps:

* Build host tools necessary to build the compiler such as texinfo, bison, gperf.
* Build target tools for use by the compiler such as binutils (bfd, binutils, gas, gprof, ld, and opcodes)
if they have been individually linked or moved into the top level GCC source tree before configuring.
* Perform a 3-stage bootstrap of the compiler.
* Perform a comparison test of the stage2 and stage3 compilers.
* Build runtime libraries using the stage3 compiler from the previous step.

If you are short on disk space you might consider `make bootstrap-lean' instead. This is identical to `make bootstrap' except that object files from the stage1 and stage2 of the 3-stage bootstrap of the compiler are deleted as soon as they are no longer needed.

If you want to save additional space during the bootstrap and in the final installation as well, you can build the compiler binaries without debugging information with "make CFLAGS='-O' LIBCFLAGS='-g -O2' LIBCXXFLAGS='-g -O2 -fno-implicit-templates' bootstrap". This will save roughly 40% of disk space both for the bootstrap and the final installation. (Libraries will still contain debugging information.)

If you used the flag --enable-languages=... to restrict the compilers to be built, only those you've actually enabled will be built. This will of course only build those runtime libraries, for which the particular compiler has been built. Please note, that re-defining LANGUAGES when calling `make bootstrap' *does not* work anymore!

_________________"I knew when an angel whispered into my ear,
You gotta get him away, yeah
Hey little bitch!
Be glad you finally walked away or you may have not lived another day."
Godsmack

gcc (the C compiler) is built three times. gentoo uses profiledbootstrap so you can identify the three stages as xgcc, gcc --fprofile-generate, and gcc -fprofile-use. fortran, g++, objc, etc are only built once under the current bootstrap system. this changes in 4.2 where everything will be built three times.

Guenther: thanks!_________________by design, by neglect
for a fact or just for effect