Embedded Bitcode

Jonas Devlieghere

Little over a year ago, Apple announced at WWDC 2015 the ability to embed bitcode in Mach-O files.

Bitcode is the intermediate representation used by the LLVM compiler and contains all the information required to recompile an application. Having the bitcode present, in addition to machine code, Apple can further optimize applications by compiling and linking specifically for the user's target device. This is one approach to app thinning, which aims to achieve smaller binaries and therefore more free space on your iDevice. It will most likely replace Apple's current approach, where a developer uploads a fat binary to the App Store, which contains machine code for each target architecture.

Now you may wonder, like I did:

Why not just ship the machine code for the target device?

Why do we need bitcode?

The answer to the first question is "they do". It's the whole point of app thinning, and exactly what slicing does. Slicing occurs independent of having bitcode enabled.

Slicing is the process of creating and delivering variants of the app bundle for different target devices. A variant contains only the executable architecture and resources that are needed for the target device. — Source

Slicing doesn't require bitcode, so thinning was probably not the biggest motivation for embedding bitcode. This brings us two the answer of the second question.

First, there's optimization, which has always played a central role in the LLVM project. Even more, "lifelong analysis and transformation" (read optimization), was the original reason for LLVM's existence.

Secondly, if a future iPhone boasts a new instruction set architecture (ISA), Apple can simply recompile all the existing app in it's store, without every developers having to release a new version. Apple suffered when moving from armv7 to arm64, as they had to wait for developers to make their app compatible with the new device. It should be noted though that bitcode is usually not target-independent, i.e. it is tied to a certain architecture. So it still remains to be seen how well this works in practice.

Finally, bitcode would allow for more advanced analysis of applications, for example for detecting the presence of malware or the use of private frameworks. Apps are already reviewed before they end up in the store, but bitcode, being of a higher level than machine code, will certainly make it easier to process applications.

What's new?

No big revelations this year at WWDC 2016, though I must admit I only looked at the slides of the LLVM & Swift keynote. Behind the scenes, Apple started the process of upstreaming the code related to embedded bitcode in early February this year. When finished, developers will no longer be limited to Apple's version of clang for making binaries with bitcode. However, in order to submit your application to the App Store, you'll still be required to use the default toolchain.

How does it work?

First off, you'll be able to pass the -fembed-bitcode option, like you do now with Xcode's clang. Values would range from no bitcode, to just the section (see the next paragraph) to everything.

Basically, a Mach-O object consists of multiple segments. Each segment can have multiple sections. By convention, segments and section start with two underscores, followed by a name in all uppercase or all lowercase for segments and sections respectively.

Once upstreamed, embedded bitcode adds three sections under the LLVM segment to Mach-O object files:

A section __LLVM, __bitcode for storing the (optimized) bitcode. This is plain, binary bitcode, not wrapped in an archive as there's only one file for each object.

A section __LLVM, __cmdline for storing the clang command-line options that are required to rebuild the object. Only options that actually affect code generation and are not already properly stored in bitcode attributes would end up here.

An empty section __LLVM, __asm to differentiate objects without bitcode from those built from assembly. The reason for creating a section rather than a new Mach-O command is that there aren't much left of the latter. Furthermore, the developers entertain the idea of wrapping assembly in Module-Level Inline Assembly and storing the resulting bitcode here.

The section names are slightly different for other executable file formats such as ELF and PE. The first two sections are named .llvmbc and . llvmcmd respectively. I didn't see a patch yet for the third section but I think it's safe to assume the third section will be consistently labeled .llvmasm.

Swift and Apple LLVM already create these sections for its object files. However, swiftc currently stores its command line arguments in a section with a slightly different name: __swift, __cmdline.

At link time, object files are linked together by the linker (ld) into a Mach-O executable . The bitcode from the different objects files is taken from the __bitcode section and put into separate files. These files are each added to a xar-archive, together with the command line arguments from the __cmdline section. The latter are stored as metadata in the archive's table of content (TOC). The resulting archive is put in __bundle section of the __LLVM segment. As a result, for executables to support bitcode, changes to the linker are required for Mach-O, ELF and PE on their respective platforms.

Use otool to inspect the content of a section. The -v flag makes the output verbose and symbolic when possible. For the bitcode section this means we get to see the xar's table of content in an XML format, rather than just a bunch of bytes.

otool -v -s __LLVM __bundle binary

Bitcode in Xcode

Xcode 7.0 introduced the ENABLE_BITCODE option. When enabled, this passes -fembed-bitcode to clang when making an archive build, i.e. the one you submit to the App Store. Bitcode ends up in the resulting Mach-O as described above. For all other build types, -fembed-bitcode-marker is passed, which results in an empty __LLVM segment, without any real content. This speeds up compilation while maintaining enough information to validate if all the files contain a bitcode section.

Things to Consider

Having bitcode present has some serious implications. Frederic Jacobs addresses many of them in his blog post titled "Why I’m not enabling Bitcode". Apple has responded with a follow-up in June, where they propose two mitigations:

Symbol Hiding: Symbols in the IR would be removed, comparable to the strip utility that removes symbols from machine code.

Debug Stripping: All debug info would be removed with the exception of line-tables.

Note that in order for the strip pass to be consistent, it must have knowledge of all files involved. As such, it can only be performed at link time. In most cases this is reasonable, as there is one point in time where all bitcode gets bundled. Furthermore, Apple is pushing LTO (Link Time Optimization) so their choice for this trade-off is aligned with their general vision and doesn't come as a big surprise.

Getting Bitcode from Mach-O

Alex Denisov created bitcode_retriever, a small tool for obtaining the bitcode archive from a Mach-O binary. I later improved it a little, adding the ability to extract the archive and obtain bitcode directly, to obtain the linker flags from the xar TOC, as well as making it suitable for use as a C/C++ library.

I later created LibEBC, which is a library built on top of LLVM for extracting bitcode from binaries. It works with both object files as well as libraries. It can be used as a library or as a standalone tool, called ebcutil. It's open source and also available on GitHub.

Further Reading

If you want a deeper understanding of what bitcode is and how it works, I recommend the following blog posts.