The Internal Structure of Python Eggs

STOP! This is not the first document you should read!

This document assumes you have at least some passing familiarity with
using setuptools, the pkg_resources module, and EasyInstall. It
does not attempt to explain basic concepts like inter-project
dependencies, nor does it contain detailed lexical syntax for most
file formats. Neither does it explain concepts like "namespace
packages" or "resources" in any detail, as all of these subjects are
covered at length in the setuptools developer's guide and the
pkg_resources reference manual.

Instead, this is internal documentation for how those concepts and
features are implemented in concrete terms. It is intended for people
who are working on the setuptools code base, who want to be able to
troubleshoot setuptools problems, want to write code that reads the file
formats involved, or want to otherwise tinker with setuptools-generated
files and directories.

Note, however, that these are all internal implementation details and
are therefore subject to change; stick to the published API if you don't
want to be responsible for keeping your code from breaking when
setuptools changes. You have been warned.

A "Python egg" is a logical structure embodying the release of a
specific version of a Python project, comprising its code, resources,
and metadata. There are multiple formats that can be used to physically
encode a Python egg, and others can be developed. However, a key
principle of Python eggs is that they should be discoverable and
importable. That is, it should be possible for a Python application to
easily and efficiently find out what eggs are present on a system, and
to ensure that the desired eggs' contents are importable.

There are two basic formats currently implemented for Python eggs:

.egg format: a directory or zipfile containing the project's
code and resources, along with an EGG-INFO subdirectory that
contains the project's metadata

.egg-info format: a file or directory placed adjacent to the
project's code and resources, that directly contains the project's
metadata.

Both formats can include arbitrary Python code and resources, including
static data files, package and non-package directories, Python
modules, C extension modules, and so on. But each format is optimized
for different purposes.

The .egg format is well-suited to distribution and the easy
uninstallation or upgrades of code, since the project is essentially
self-contained within a single directory or file, unmingled with any
other projects' code or resources. It also makes it possible to have
multiple versions of a project simultaneously installed, such that
individual programs can select the versions they wish to use.

The .egg-info format, on the other hand, was created to support
backward-compatibility, performance, and ease of installation for system
packaging tools that expect to install all projects' code and resources
to a single directory (e.g. site-packages). Placing the metadata
in that same directory simplifies the installation process, since it
isn't necessary to create .pth files or otherwise modify
sys.path to include each installed egg.

Its disadvantage, however, is that it provides no support for clean
uninstallation or upgrades, and of course only a single version of a
project can be installed to a given directory. Thus, support from a
package management tool is required. (This is why setuptools' "install"
command refers to this type of egg installation as "single-version,
externally managed".) Also, they lack sufficient data to allow them to
be copied from their installation source. easy_install can "ship" an
application by copying .egg files or directories to a target
location, but it cannot do this for .egg-info installs, because
there is no way to tell what code and resources belong to a particular
egg -- there may be several eggs "scrambled" together in a single
installation location, and the .egg-info format does not currently
include a way to list the files that were installed. (This may change
in a future version.)

The layout of the code and resources is dictated by Python's normal
import layout, relative to the egg's "base location".

For the .egg format, the base location is the .egg itself. That
is, adding the .egg filename or directory name to sys.path
makes its contents importable.

For the .egg-info format, however, the base location is the
directory that contains the .egg-info, and thus it is the
directory that must be added to sys.path to make the egg importable.
(Note that this means that the "normal" installation of a package to a
sys.path directory is sufficient to make it an "egg" if it has an
.egg-info file or directory installed alongside of it.)

If eggs contained only code and resources, there would of course be
no difference between them and any other directory or zip file on
sys.path. Thus, metadata must also be included, using a metadata
file or directory.

For the .egg format, the metadata is placed in an EGG-INFO
subdirectory, directly within the .egg file or directory. For the
.egg-info format, metadata is stored directly within the
.egg-info directory itself.

The minimum project metadata that all eggs must have is a standard
Python PKG-INFO file, named PKG-INFO and placed within the
metadata directory appropriate to the format. Because it's possible for
this to be the only metadata file included, .egg-info format eggs
are not required to be a directory; they can just be a .egg-info
file that directly contains the PKG-INFO metadata. This eliminates
the need to create a directory just to store one file. This option is
not available for .egg formats, since setuptools always includes
other metadata. (In fact, setuptools itself never generates
.egg-info files, either; the support for using files was added so
that the requirement could easily be satisfied by other tools, such
as the distutils in Python 2.5).

In addition to the PKG-INFO file, an egg's metadata directory may
also include files and directories representing various forms of
optional standard metadata (see the section on Standard Metadata,
below) or user-defined metadata required by the project. For example,
some projects may define a metadata format to describe their application
plugins, and metadata in this format would then be included by plugin
creators in their projects' metadata directories.

To allow introspection of installed projects and runtime resolution of
inter-project dependencies, a certain amount of information is embedded
in egg filenames. At a minimum, this includes the project name, and
ideally will also include the project version number. Optionally, it
can also include the target Python version and required runtime
platform if platform-specific C code is included. The syntax of an
egg filename is as follows:

name ["-" version ["-py" pyver ["-" required_platform]]] "." ext

The "name" and "version" should be escaped using the to_filename()
function provided by pkg_resources, after first processing them with
safe_name() and safe_version() respectively. These latter two
functions can also be used to later "unescape" these parts of the
filename. (For a detailed description of these transformations, please
see the "Parsing Utilities" section of the pkg_resources manual.)

The "pyver" string is the Python major version, as found in the first
3 characters of sys.version. "required_platform" is essentially
a distutils get_platform() string, but with enhancements to properly
distinguish Mac OS versions. (See the get_build_platform()
documentation in the "Platform Utilities" section of the
pkg_resources manual for more details.)

Finally, the "ext" is either .egg or .egg-info, as appropriate
for the egg's format.

Normally, an egg's filename should include at least the project name and
version, as this allows the runtime system to find desired project
versions without having to read the egg's PKG-INFO to determine its
version number.

Setuptools, however, only includes the version number in the filename
when an .egg file is built using the bdist_egg command, or when
an .egg-info directory is being installed by the
install_egg_info command. When generating metadata for use with the
original source tree, it only includes the project name, so that the
directory will not have to be renamed each time the project's version
changes.

This is especially important when version numbers change frequently, and
the source metadata directory is kept under version control with the
rest of the project. (As would be the case when the project's source
includes project-defined metadata that is not generated from by
setuptools from data in the setup script.)

In addition to the .egg and .egg-info formats, there is a third
egg-related extension that you may encounter on occasion: .egg-link
files.

These files are not eggs, strictly speaking. They simply provide a way
to reference an egg that is not physically installed in the desired
location. They exist primarily as a cross-platform alternative to
symbolic links, to support "installing" code that is being developed in
a different location than the desired installation location. For
example, if a user is developing an application plugin in their home
directory, but the plugin needs to be "installed" in an application
plugin directory, running "setup.py develop -md /path/to/app/plugins"
will install an .egg-link file in /path/to/app/plugins, that
tells the egg runtime system where to find the actual egg (the user's
project source directory and its .egg-info subdirectory).

.egg-link files are named following the format for .egg and
.egg-info names, but only the project name is included; no version,
Python version, or platform information is included. When the runtime
searches for available eggs, .egg-link files are opened and the
actual egg file/directory name is read from them.

Each .egg-link file should contain a single file or directory name,
with no newlines. This filename should be the base location of one or
more eggs. That is, the name must either end in .egg, or else it
should be the parent directory of one or more .egg-info format eggs.

As of setuptools 0.6c6, the path may be specified as a platform-independent
(i.e. /-separated) relative path from the directory containing the
.egg-link file, and a second line may appear in the file, specifying a
platform-independent relative path from the egg's base directory to its
setup script directory. This allows installation tools such as EasyInstall
to find the project's setup directory and build eggs or perform other setup
commands on it.

In addition to the minimum required PKG-INFO metadata, projects can
include a variety of standard metadata files or directories, as
described below. Except as otherwise noted, these files and directories
are automatically generated by setuptools, based on information supplied
in the setup script or through analysis of the project's code and
resources.

Most of these files and directories are generated via "egg-info
writers" during execution of the setuptools egg_info command, and
are listed in the egg_info.writers entry point group defined by
setuptools' own setup.py file.

Project authors can register their own metadata writers as entry points
in this group (as described in the setuptools manual under "Adding new
EGG-INFO Files") to cause setuptools to generate project-specific
metadata files or directories during execution of the egg_info
command. It is up to project authors to document these new metadata
formats, if they create any.

Files described in this section that have .txt extensions have a
simple lexical format consisting of a sequence of text lines, each line
terminated by a linefeed character (regardless of platform). Leading
and trailing whitespace on each line is ignored, as are blank lines and
lines whose first nonblank character is a # (comment symbol). (This
is the parsing format defined by the yield_lines() function of
the pkg_resources module.)

All .txt files defined by this section follow this format, but some
are also "sectioned" files, meaning that their contents are divided into
sections, using square-bracketed section headers akin to Windows
.ini format. Note that this does not imply that the lines within
the sections follow an .ini format, however. Please see an
individual metadata file's documentation for a description of what the
lines and section names mean in that particular file.

Sectioned files can be parsed using the split_sections() function;
see the "Parsing Utilities" section of the pkg_resources manual for
for details.

This is a "sectioned" text file. Each section is a sequence of
"requirements", as parsed by the parse_requirements() function;
please see the pkg_resources manual for the complete requirement
parsing syntax.

The first, unnamed section (i.e., before the first section header) in
this file is the project's core requirements, which must be installed
for the project to function. (Specified using the install_requires
keyword to setup()).

The remaining (named) sections describe the project's "extra"
requirements, as specified using the extras_require keyword to
setup(). The section name is the name of the optional feature, and
the section body lists that feature's dependencies.

Note that it is not normally necessary to inspect this file directly;
pkg_resources.Distribution objects have a requires() method
that can be used to obtain Requirement objects describing the
project's core and optional dependencies.

A list of dependency URLs, one per line, as specified using the
dependency_links keyword to setup(). These may be direct
download URLs, or the URLs of web pages containing direct download
links, and will be used by EasyInstall to find dependencies, as though
the user had manually provided them via the --find-links command
line option. Please see the setuptools manual and EasyInstall manual
for more information on specifying this option, and for information on
how EasyInstall processes --find-links URLs.

This file follows an identical format to requires.txt, but is
obsolete and should not be used. The earliest versions of setuptools
required users to manually create and maintain this file, so the runtime
still supports reading it, if it exists. The new filename was created
so that it could be automatically generated from setup() information
without overwriting an existing hand-created depends.txt, if one
was already present in the project's source .egg-info directory.

A list of namespace package names, one per line, as supplied to the
namespace_packages keyword to setup(). Please see the manuals
for setuptools and pkg_resources for more information about
namespace packages.

This is a "sectioned" text file, whose contents encode the
entry_points keyword supplied to setup(). All sections are
named, as the section names specify the entry point groups in which the
corresponding section's entry points are registered.

Each section is a sequence of "entry point" lines, each parseable using
the EntryPoint.parse classmethod; please see the pkg_resources
manual for the complete entry point parsing syntax.

Note that it is not necessary to parse this file directly; the
pkg_resources module provides a variety of APIs to locate and load
entry points automatically. Please see the setuptools and
pkg_resources manuals for details on the nature and uses of entry
points.

This directory is currently only created for .egg files built by
the setuptools bdist_egg command. It will contain copies of all
of the project's "traditional" scripts (i.e., those specified using the
scripts keyword to setup()). This is so that they can be
reconstituted when an .egg file is installed.

The scripts are placed here using the disutils' standard
install_scripts command, so any #! lines reflect the Python
installation where the egg was built. But instead of copying the
scripts to the local script installation directory, EasyInstall writes
short wrapper scripts that invoke the original scripts from inside the
egg, after ensuring that sys.path includes the egg and any eggs it
depends on. For more about script wrappers, see the section below on
Installation and Path Management Issues.

A list of C extensions and other dynamic link libraries contained in
the egg, one per line. Paths are /-separated and relative to the
egg's base location.

This file is generated as part of bdist_egg processing, and as such
only appears in .egg files (and .egg directories created by
unpacking them). It is used to ensure that all libraries are extracted
from a zipped egg at the same time, in case there is any direct linkage
between them. Please see the Zip File Issues section below for more
information on library and resource extraction from .egg files.

A list of resource files and/or directories, one per line, as specified
via the eager_resources keyword to setup(). Paths are
/-separated and relative to the egg's base location.

Resource files or directories listed here will be extracted
simultaneously, if any of the named resources are extracted, or if any
native libraries listed in native_libs.txt are extracted. Please
see the setuptools manual for details on what this feature is used for
and how it works, as well as the Zip File Issues section below.

These are zero-length files, and either one or the other should exist.
If zip-safe exists, it means that the project will work properly
when installedas an .egg zipfile, and conversely the existence of
not-zip-safe means the project should not be installed as an
.egg file. The zip_safe option to setuptools' setup()
determines which file will be written. If the option isn't provided,
setuptools attempts to make its own assessment of whether the package
can work, based on code and content analysis.

If neither file is present at installation time, EasyInstall defaults
to assuming that the project should be unzipped. (Command-line options
to EasyInstall, however, take precedence even over an existing
zip-safe or not-zip-safe file.)

Note that these flag files appear only in .egg files generated by
bdist_egg, and in .egg directories created by unpacking such an
.egg file.

This file is a list of the top-level module or package names provided
by the project, one Python identifier per line.

Subpackages are not included; a project containing both a foo.bar
and a foo.baz would include only one line, foo, in its
top_level.txt.

This data is used by pkg_resources at runtime to issue a warning if
an egg is added to sys.path when its contained packages may have
already been imported.

(It was also once used to detect conflicts with non-egg packages at
installation time, but in more recent versions, setuptools installs eggs
in such a way that they always override non-egg packages, thus
preventing a problem from arising.)

This file is roughly equivalent to the distutils' MANIFEST file.
The differences are as follows:

The filenames always use / as a path separator, which must be
converted back to a platform-specific path whenever they are read.

The file is automatically generated by setuptools whenever the
egg_info or sdist commands are run, and it is not
user-editable.

Although this metadata is included with distributed eggs, it is not
actually used at runtime for any purpose. Its function is to ensure
that setuptools-built source distributions can correctly discover
what files are part of the project's source, even if the list had been
generated using revision control metadata on the original author's
system.

In other words, SOURCES.txt has little or no runtime value for being
included in distributed eggs, and it is possible that future versions of
the bdist_egg and install_egg_info commands will strip it before
installation or distribution. Therefore, do not rely on its being
available outside of an original source directory or source
distribution.

Although zip files resemble directories, they are not fully
substitutable for them. Most platforms do not support loading dynamic
link libraries contained in zipfiles, so it is not possible to directly
import C extensions from .egg zipfiles. Similarly, there are many
existing libraries -- whether in Python or C -- that require actual
operating system filenames, and do not work with arbitrary "file-like"
objects or in-memory strings, and thus cannot operate directly on the
contents of zip files.

To address these issues, the pkg_resources module provides a
"resource API" to support obtaining either the contents of a resource,
or a true operating system filename for the resource. If the egg
containing the resource is a directory, the resource's real filename
is simply returned. However, if the egg is a zipfile, then the
resource is first extracted to a cache directory, and the filename
within the cache is returned.

The cache directory is determined by the pkg_resources API; please
see the set_cache_path() and get_default_cache() documentation
for details.

Resources are extracted to a cache subdirectory whose name is based
on the enclosing .egg filename and the path to the resource. If
there is already a file of the correct name, size, and timestamp, its
filename is returned to the requester. Otherwise, the desired file is
extracted first to a temporary name generated using
mkstemp(".$extract",target_dir), and then its timestamp is set to
match the one in the zip file, before renaming it to its final name.
(Some collision detection and resolution code is used to handle the
fact that Windows doesn't overwrite files when renaming.)

If a resource directory is requested, all of its contents are
recursively extracted in this fashion, to ensure that the directory
name can be used as if it were valid all along.

If the resource requested for extraction is listed in the
native_libs.txt or eager_resources.txt metadata files, then
all resources listed in either file will be extracted before the
requested resource's filename is returned, thus ensuring that all
C extensions and data used by them will be simultaneously available.

Since Python's built-in zip import feature does not support loading
C extension modules from zipfiles, the setuptools bdist_egg command
generates special import wrappers to make it work.

The wrappers are .py files (along with corresponding .pyc
and/or .pyo files) that have the same module name as the
corresponding C extension. These wrappers are located in the same
package directory (or top-level directory) within the zipfile, so that
say, foomodule.so will get a corresponding foo.py, while
bar/baz.pyd will get a corresponding bar/baz.py.

These wrapper files contain a short stanza of Python code that asks
pkg_resources for the filename of the corresponding C extension,
then reloads the module using the obtained filename. This will cause
pkg_resources to first ensure that all of the egg's C extensions
(and any accompanying "eager resources") are extracted to the cache
before attempting to link to the C library.

Note, by the way, that .egg directories will also contain these
wrapper files. However, Python's default import priority is such that
C extensions take precedence over same-named Python modules, so the
import wrappers are ignored unless the egg is a zipfile.

Python's initial setup of sys.path is very dependent on the Python
version and installation platform, as well as how Python was started
(i.e., script vs. -c vs. -m vs. interactive interpreter).
In fact, Python also provides only two relatively robust ways to affect
sys.path outside of direct manipulation in code: the PYTHONPATH
environment variable, and .pth files.

However, with no cross-platform way to safely and persistently change
environment variables, this leaves .pth files as EasyInstall's only
real option for persistent configuration of sys.path.

But .pth files are rather strictly limited in what they are allowed
to do normally. They add directories only to the end of sys.path,
after any locally-installed site-packages directory, and they are
only processed in the site-packages directory to start with.

This is a double whammy for users who lack write access to that
directory, because they can't create a .pth file that Python will
read, and even if a sympathetic system administrator adds one for them
that calls site.addsitedir() to allow some other directory to
contain .pth files, they won't be able to install newer versions of
anything that's installed in the systemwide site-packages, because
their paths will still be added aftersite-packages.

So EasyInstall applies two workarounds to solve these problems.

The first is that EasyInstall leverages .pth files' "import" feature
to manipulate sys.path and ensure that anything EasyInstall adds
to a .pth file will always appear before both the standard library
and the local site-packages directories. Thus, it is always
possible for a user who can write a Python-read .pth file to ensure
that their packages come first in their own environment.

Second, when installing to a PYTHONPATH directory (as opposed to
a "site" directory like site-packages) EasyInstall will also install
a special version of the site module. Because it's in a
PYTHONPATH directory, this module will get control before the
standard library version of site does. It will record the state of
sys.path before invoking the "real" site module, and then
afterwards it processes any .pth files found in PYTHONPATH
directories, including all the fixups needed to ensure that eggs always
appear before the standard library in sys.path, but are in a relative
order to one another that is defined by their PYTHONPATH and
.pth-prescribed sequence.

The net result of these changes is that sys.path order will be
as follows at runtime:

The sys.argv[0] directory, or an emtpy string if no script
is being executed.

All eggs installed by EasyInstall in any .pth file in each
PYTHONPATH directory, in order first by PYTHONPATH order,
then normal .pth processing order (which is to say alphabetical
by .pth filename, then by the order of listing within each
.pth file).

All eggs installed by EasyInstall in any .pth file in each "site"
directory (such as site-packages), following the same ordering
rules as for the ones on PYTHONPATH.

The PYTHONPATH directories themselves, in their original order

Any paths from .pth files found on PYTHONPATH that were not
eggs installed by EasyInstall, again following the same relative
ordering rules.

The standard library and "site" directories, along with the contents
of any .pth files found in the "site" directories.

Notice that sections 1, 4, and 6 comprise the "normal" Python setup for
sys.path. Sections 2 and 3 are inserted to support eggs, and
section 5 emulates what the "normal" semantics of .pth files on
PYTHONPATH would be if Python natively supported them.

For further discussion of the tradeoffs that went into this design, as
well as notes on the actual magic inserted into .pth files to make
them do these things, please see also the following messages to the
distutils-SIG mailing list:

EasyInstall never directly installs a project's original scripts to
a script installation directory. Instead, it writes short wrapper
scripts that first ensure that the project's dependencies are active
on sys.path, before invoking the original script. These wrappers
have a #! line that points to the version of Python that was used to
install them, and their second line is always a comment that indicates
the type of script wrapper, the project version required for the script
to run, and information identifying the script to be invoked.

The format of this marker line is:

"# EASY-INSTALL-" script_type ": " tuple_of_strings "\n"

The script_type is one of SCRIPT, DEV-SCRIPT, or
ENTRY-SCRIPT. The tuple_of_strings is a comma-separated
sequence of Python string constants. For SCRIPT and DEV-SCRIPT
wrappers, there are two strings: the project version requirement, and
the script name (as a filename within the scripts metadata
directory). For ENTRY-SCRIPT wrappers, there are three:
the project version requirement, the entry point group name, and the
entry point name. (See the "Automatic Script Creation" section in the
setuptools manual for more information about entry point scripts.)

In each case, the project version requirement string will be a string
parseable with the pkg_resources modules' Requirement.parse()
classmethod. The only difference between a SCRIPT wrapper and a
DEV-SCRIPT is that a DEV-SCRIPT actually executes the original
source script in the project's source tree, and is created when the
"setup.py develop" command is run. A SCRIPT wrapper, on the other
hand, uses the "installed" script written to the EGG-INFO/scripts
subdirectory of the corresponding .egg zipfile or directory.
(.egg-info eggs do not have script wrappers associated with them,
except in the "setup.py develop" case.)

The purpose of including the marker line in generated script wrappers is
to facilitate introspection of installed scripts, and their relationship
to installed eggs. For example, an uninstallation tool could use this
data to identify what scripts can safely be removed, and/or identify
what scripts would stop working if a particular egg is uninstalled.