If you have lots of "small" objects in a Python program (objects which
have few instance attributes), you may find that the object overhead
starts to become considerable. The common wisdom says that to reduce this in
CPython you need to re-define the classes to use __slots__, eliminating
the attribute dictionary. But this comes with the downsides of
limiting flexibility and eliminating the use of class defaults. Would
it surprise you to learn that PyPy can significantly, and without
any effort by the programmer, reduce that overhead automatically?

Contrary to advice, instead of starting at the very beginning,
we'll jump right to the end. The following graph shows the peak memory
usage of the example program we'll be talking about in this post
across seven different Python implementations: PyPy2 v6.0, PyPy3 v6.0,
CPython 2.7.15, 3.4.9, 3.5.6, 3.6.6, and 3.7.0 [1].

For regular objects ("Point3D"), PyPy needs less than 700MB to create
10,000,000, where CPython 2.7 needs almost 3.5 GB, and CPython 3.x
needs between 1.5 and 2.1 GB [6]. Moving to __slots__ ("Point3DSlot")
brings the CPython overhead closer to—but still higher than—that of PyPy. In particular, note that the PyPy memory usage is
essentially the same whether or not slots are used.

The third group of data is the same as the second group, except
instead of using small integers that should be in the CPython internal
integer object cache [7], I used larger numbers that shouldn't be
cached. This is just an interesting data point showing the allocation
of three times as many objects, and won't be discussed further.

In the script I used to produce these numbers [2], I'm using the
excellent psutil library's Process.memory_info to record the
"unique set size" ("the memory which is unique to a process and which
would be freed if the process was terminated right now") before and
then after allocating a large number of objects.

This gives us a fairly accurate idea of how much memory the processes
needed to allocate from the operating system to be able to create all
the objects we asked for. (get_memory is a helper function that
runs the garbage collector to be sure we have the most stable
numbers.)

This was the first of the test runs within this particular process.
The second test run within this process reports higher absolute deltas
since the beginning of the program, although the overall deltas are
smaller. This indicates how much memory the program has allocated from
the operating system but not returned to it, even though it may
technically free from the standpoint of the Python runtime; this
accounts for things like internal caches, or in PyPy's case, jitted
code.

Although I captured the data, this post is not about the startup or
initial memory allocation of the various interpreters, nor about how
much can easily be shared between forked processes, nor about how much
memory is returned to the operating system while the process is still
running. We're only talking about the memory size needed to allocate a
given number of objects, e.g., the Delta column.

Standard objects, like Point3D, have a special attribute__dict__, that is a normal Python dictionary object that is used
to hold all the instance attributes for the object. We previously
looked at
how __getattribute__ can be used to customize all attribute
reads for an object; likewise, __setattr__ can customize all
attribute writes. The default __getattribute__ and
__setattr__ that a class inherits from object function
something like they were written to access the __dict__:

One advantage of having a __dict__ underlying an object is the
flexibility it provides: you don't have to pre-declare your attributes
for every object, and any object can have any attribute, so it
facilitates subclasses adding new attributes, or even other libraries
adding new, specialized, attributes to implement caching of expensive
computed properties.

One disadvantage is that a __dict__ is a generic Python
dictionary, not specialized at all [3], and as such it has overhead.

On CPython, we can ask the interpreter how much memory any given
object uses with sys.getsizeof. On my machine under a 64-bit
CPython 2.7.15, a bare object takes 16 bytes, while a trivial
subclass takes a full 64 bytes (due to the overhead of being tracked
by the garbage collector):

These values can change quite a bit across Python versions,
typically improving over time. In CPython 3.4 and 3.5,
getsizeof({}) returns 288, while it returns 240 in both 3.6 and
3.7. In addition, getsizeof(pd.__dict__) returns 96 and 112
[4]. The answer to getsizeof(pd) is 56 in all four versions.

Objects with a __slots__ declaration, like Point3DSlot do not
have a __dict__ by default. The documentation notes that this
can be a space savings. Indeed, on CPython 2.7, a Point3DSlot has
a size of only 72 bytes, only one full pointer larger than a trivial
subclass (when we do not factor in the __dict__):

>>> pds=Point3DSlot(1,2,3)>>> sys.getsizeof(pds)72

If they don't have an instance dictionary, where do they store their
attributes? And why, if Point3DSlot has three defined
attributes, is it only one pointer larger than Point3D?

Slots, like @property, @classmethod and @staticmethod, are
implemented using descriptors. For our purpose, descriptors are a
way to extend the workings of __getattribute__ and friends. A
descriptor is an object whose type implements a __get__ method,
and when that object is found in a type's dictionary, it is called
instead of checking the __dict__. Something like this [5]:

When the class statement (indeed, when the type metaclass)
finds __slots__ in the class body (the class dictionary), it
takes special steps. Most importantly, it creates a descriptor for
each mentioned slot and places it in the class's __dict__. So our
Point3DSlot class gets three such descriptors:

We've established how we can access these magic, hidden slotted
attributes (through the descriptor protocol). (We've also established
why we can't have defaults for slotted attributes in the class.) But
we still haven't found out where they are stored. If they're not in
a dictionary, where are they?

The answer is that they're stored directly in the object itself. Every
type has a member called tp_basicsize, exposed to Python as
__basicsize__. When the interpreter allocates an object, it
allocates __basicsize__ bytes for it (every object has a minimum
basic size, the size of object). The type metaclass arranges
for __basicsize__ to be big enough to hold (a pointer to) each of
the slotted attributes, which are kept in memory immediately after the
data for the basic object . The descriptor for each attribute,
then, just does some pointer arithmetic off of self to read and
write the value. In a way, it's very similar to how
collections.namedtuple works, except using pointers instead of
indices.

That may be hard to follow, so here's an example.

The basic size of object exactly matches the reported size of its
instances:

>>> object.__basicsize__16>>> sys.getsizeof(object())16

We get the same when we create an object that cannot have any instance
variables, and hence does not need to be tracked by the garbage
collector:

When we add one slot to an object, its basic size increases by one
pointer (8 bytes), and because that object could be tracked by the
garbage collector, this object needs to be tracked by the collector,
so getsizeof reports some extra overhead:

The basic size for an object with 3 slots is 16 (the size of object) + 3 pointers, or 40.
What's the basic size for an object that has a __dict__?

>>> Point3DSlot.__basicsize__40>>> Point.__basicsize__32

Hmm, it's 16 + 2 pointers. What could those two pointers be?
Documentation to the rescue:

__slots__ allow us to explicitly declare data members (like
properties) and deny the creation of __dict__ and
__weakref__ (unless explicitly declared in __slots__...)

So those two pointers are for __dict__ and __weakref__, things
that standard objects get automatically, but which we have to opt-in
to if we want them with __slots__. Thus, an object with three
slots is one pointer size bigger than a standard object.

By now we should understand why the memory usage dropped significantly
when we added __slots__ to our objects on CPython (although that
comes with a cost). That leaves the question: how does PyPy get such
good memory performance with a __dict__ that __slots__ doesn't
even matter?

Earlier I wrote that the __dict__ of an instance is just a
standard dictionary, not specialized at all. That's basically true on
CPython, but it's not at all true on PyPy. PyPy basically fakes
__dict__ by using __slots__ for all objects.

A given set of attributes (such as our "x", "y", "z" attributes for
Point3DSlot) is called a "map". Each instance refers to its map,
which tells PyPy how to efficiently access a given attribute. When an
attribute is added or deleted, a new map is created (or re-used from
an existing object; objects of completely unrelated types, but having
common attributes can share the same maps) and assigned to the object,
re-arranging things as needed. It's as if __slots__ was assigned
to each instance, with descriptors added and removed for the instance
on the fly.

If the program ever directly accesses an instance's __dict__, PyPy
creates a thin wrapper object that operates on the object's map.

So for a program that has many simalar looking objects, even if
unrelated, PyPy's approach can save a lot of memory. On the other
hand, if the program creates objects that have a very diverse set of
attributes, and that program frequently directly accessess
__dict__, it's theoretically possible that PyPy could use more
memory than CPython.

The CPython dict implementation was completely overhauled in
CPython 3.6.
And based on the sizes of {} versus pd.__dict__ we
can see some sort of specialization for instance
dictionaries, at least in terms of their fill factor.