Exact garbage collection for the strongly-typed Java language may
seem straightforward. Unfortunately, a single pair of bytecodes in
the Java Virtual Machine instruction set presents an obstacle that
has thus far not been discussed in the literature. We explain the
problem, outline the space of possible solutions, and present a
solution utilizing bytecode-preprocessing to enable exact garbage
collection while maintaining compatibility with existing compiled
Java class files.

Full precision in garbage collection implies retaining only those
heap allocated objects that will actually be used in the future.
Since full precision is not computable in general, garbage
collectors use safe (i.e., conservative) approximations such as
reachability from a set of root references. Ambiguous roots
collectors (commonly called “conservative”) can be overly
conservative because they overestimate the root set, and thereby
retain unexpectedly large amounts of garbage. We consider two more
precise collection schemes for Java virtual machines (JVMs). One
uses a type analysis to obtain a type-precise root set (only those
variables that contain references); the other adds a live variable
analysis to reduce the root set to only the live reference
variables. Even with the Java programming language’s strong
typing, it turns out that the JVM specification has a feature that
makes type-precise root sets difficult to compute. We explain the
problem and ways in which it can be solved.

Our experimental results include measurements of the costs of the
type and liveness analyses at load time, of the incremental
benefits at run time of the liveness analysis over the
type-analysis alone, and of various map sixes and counts. We find
that the liveness analysis often produces little or no improvement
in heap size, sometimes modest improvements, and occasionally the
improvement is dramatic. While further study is in order, we
conclude that the main benefit of the liveness analysis is
preventing bad surprises.

We’ve designed and implemented a copying garbage-collection
algorithm that is efficient, real-time, concurrent, runs on
commercial uniprocessors and shared-memory multiprocessors, and
requires no change to compilers. The algorithm uses standard
virtual-memory hardware to detect references to “from space”
objects and to synchronize the collector and mutator threads.
We’ve implemented and measured a prototype running on SRC’s
5-processor Firefly. It will be straightforward to merge our
techniques with generational collection. An incremental,
non-concurrent version could be implemented easily on many
versions of Unix.

Inside Macintosh: Memory describes the parts of the Macintosh®
Operating System that allow you to directly allocate, release, or
otherwise manipulate memory. Everyone who programs Macintosh
computers should read this book.

Inside Macintosh: Memory shows in detail how your application can
manage the memory partition it is allocated and perform other
memory-related operations. It also provides a complete technical
reference for the Memory Manager, the Virtual Memory Manager, and
other memory-related utilities provided by the system software.

Memory management is a critical issue for many large
object-oriented applications, but in C++ only explicit memory
reclamation through the delete operator is generally available. We
analyse different possibilities for memory management in C++ and
present a dynamic memory management framework which can be
customised to the need of specific applications. The framework
allows full integration and coexistence of different memory
management techniques. The Customisable Memory Management (CMM) is
based on a primary collector which exploits an evolution of
Bartlett’s mostly copying garbage collector. Specialised
collectors can be built for separate memory heaps. A Heap class
encapsulates the allocation strategy for each heap. We show how to
emulate different garbage collection styles or user-specific
memory management techniques. The CMM is implemented in C++
without any special support in the language or the compiler. The
techniques used in the CMM are general enough to be applicable
also to other languages.

Automatic garbage collection relieves programmers from the burden
of managing memory themselves and several techniques have been
developed that make garbage collection feasible in many
situations, including real time applications or within traditional
programming languages. However optimal performance cannot always
be achieved by a uniform general purpose solution. Sometimes an
algorithm exhibits a predictable pattern of memory usage that
could be better handled specifically, delaying as much as possible
the intervention of the general purpose collector. This leads to
the requirement for algorithm specific customisation of the
collector strategies. We present a dynamic memory management
framework which can be customised to the needs of an algorithm,
while preserving the convenience of automatic collection in the
normal case. The Customisable Memory Manager (CMM) organises
memory in multiple heaps. Each heap is an instance of a C++ class
which abstracts and encapsulates a particular storage discipline.
The default heap for collectable objects uses the technique of
mostly copying garbage collection, providing good performance and
memory compaction. Customisation of the collector is achieved
exploiting object orientation by defining specialised versions of
the collector methods for each heap class. The object oriented
interface to the collector enables coexistence and coordination
among the various collectors as well as integration with
traditional code unaware of garbage collection. The CMM is
implemented in C++ without any special support in the language or
the compiler. The techniques used in the CMM are general enough to
be applicable also to other languages. The performance of the CMM
is analysed and compared to other conservative collectors for
C/C++ in various configurations.

We consider the combination of card marking with remembered sets
for generational garbage collection as suggested by Hosking and
Moss. When more than two generations are used, a naive
implementation may cause excessive and wasteful scanning of the
cards and thus increase the collection time. We offer a simple
data structure and a corresponding algorithm to keep track of
which cards need be scanned for which generation. We then extend
these ideas for the Train Algorithm of Hudson and Moss. Here, the
solution is more involved, and allows tracking of which card
should be scanned for which car-collection in the train.

This paper investigates some problems associated with an argument
evaluation order that we call “future” order, which is different
from both call-by-name and call-by-value. In call-by-future, each
formal parameter of a function is bound to a separate process
(called a “future”) dedicated to the evaluation of the
corresponding argument. This mechanism allows the fully parallel
evaluation of arguments to a function, and has been shown to
augment the expressive power of a language.

We discuss an approach to a problem that arises in this context:
futures which were thought to be relevant when they were created
become irrelevant through being ignored in the body of the
expression where they were bound. The problem of irrelevant
processes also appears in multiprocessing problem-solving systems
which start several processors working on the same problem but
with different methods, and return with the solution which
finishes first. This “parallel method strategy” has the drawback
that the processes which are investigating the losing methods must
be identified, stopped, and reassigned to more useful tasks.

The solution we propose is that of garbage collection. We propose
that the goal structure of the solution plan be explicitly
represented in memory as part of the graph memory (like Lisp’s
heap) so that a garbage collection algorithm can discover which
processes are performing useful work, and which can be recycled
for a new task. An incremental algorithm for the unified garbage
collection of storage and processes is described.

A real-time list processing system is one in which the time
required by the elementary list operations (e.g. CONS, CAR, CDR,
RPLACA, RPLACD, EQ, and ATOM in LISP) is bounded by a (small)
constant. Classical implementations of list processing systems
lack this property because allocating a list cell from the heap
may cause a garbage collection, which process requires time
proportional to the heap size to finish. A real-time list
processing system is presented which continuously reclaims
garbage, including directed cycles, while linearizing and
compacting the accessible cells into contiguous locations to avoid
fragmenting the free storage pool. The program is small and
requires no time-sharing interrupts, making it suitable for
microcode. Finally, the system requires the same average time, and
not more than twice the space, of a classical implementation, and
those space requirements can be reduced to approximately classical
proportions by compact list representation. Arrays of different
sizes, a program stack, and hash linking are simple extensions to
our system, and reference counting is found to be inferior for
many applications.

MACLISP, unlike some other implementations of LISP, allocates
storage for different types of objects in noncontiguous areas
called “spaces”. These spaces partition the active storage into
disjoint areas, each of which holds a different type of object.
For example, “list cells” are stored in one space, “full-word
integers” reside in another space, “full-word floating point
numbers” in another, and so on.

Allocating space in this manner has several advantages. An
object’s type can easily be computed from a pointer to it, without
any memory references to the object itself. Thus, the LISP
primitive ATOM(x) can easily compute its result without even
paging in x. Another advantage is that the type of an object does
not require any storage within the object, so that arithmetic with
hardware data types such as full-word integers can use hardware
instructions directly.

There are problems associated with this method of storage and type
management, however. When all data types are allocated from the
same heap, there is no problem with varying demand for the
different data types; all data types require storage from the same
pool, so that only the total amount of storage is important. Once
different data types must be allocated from different spaces,
however, the relative sizes of the spaces becomes important.

Garbage collectors must minimize the scarce resources of cache
space and off-chip communications bandwidth to optimize
performance on modern single-chip computer architectures.
Strategies for achieving these goals in the context of copying
garbage collection are discussed. A multi-processor
mutator/collector system is analyzed. Finally, the Intel 80860XP
architecture is studied.

Linear logic has been proposed as one solution to the problem of
garbage collection and providing efficient “update-in-place”
capabilities within a more functional language. Linear logic
conserves accessibility, and hence provides a “mechanical
metaphor” which is more appropriate for a distributed-memory
parallel processor in which copying is explicit. However, linear
logic’s lack of sharing may introduce significant inefficiencies
of its own.

We show an efficient implementation of linear logic called “Linear
Lisp” that runs within a constant factor of non-linear logic. This
Linear Lisp allows RPLACX operations, and manages storage as
safely as a non-linear Lisp, but does not need a garbage
collector. Since it offers assignments but no sharing, it occupies
a twilight zone between functional languages and imperative
languages. Our Linear Lisp Machine offers many of the same
capabilities as combinator/graph reduction machines, but without
their copying and garbage collection problems.

A simple real-time garbage collection algorithm is presented which
does not copy, thereby avoiding some of the problems caused by the
asynchronous motion of objects. This in-place “treadmill” garbage
collection scheme has approximately the same complexity as other
non-moving garbage collectors, thus making it usable in a
high-level language implementation where some pointers cannot be
traced. The treadmill is currently being used in a Lisp system
built in Ada.

“Lazy allocation” is a model for allocating objects on the
execution stack of a high-level language which does not create
dangling references. Our model provides safe transportation into
the heap for objects that may survive the deallocation of the
surrounding stack frame. Space for objects that do not survive the
deallocation of the surrounding stack frame is reclaimed without
additional effort when the stack is popped. Lazy allocation thus
performs a first-level garbage collection, and if the language
supports garbage collection of the heap, then our model can reduce
the amortized cost of allocation in such a heap by filtering out
the short-lived objects that can be more efficiently managed in
LIFO order. A run-time mechanism called “result expectation”
further filters out unneeded results from functions called only
for their effects. In a shared-memory multi-processor environment,
this filtering reduces contention for the allocation and
management of global memory.

Our model performs simple local operations, and is therefore
suitable for an interpreter or a hardware implementation. Its
overheads for functional data are associated only with
assignments, making lazy allocation attractive for “mostly
functional” programming styles. Many existing stack allocation
optimizations can be seen as instances of this generic model, in
which some portion of these local operations have been optimized
away through static analysis techniques.

Important applications of our model include the efficient
allocation of temporary data structures that are passed as
arguments to anonymous procedures which may or may not use these
data structures in a stack-like fashion. The most important of
these objects are functional arguments (funargs), which require
some run-time allocation to preserve the local environment. Since
a funarg is sometimes returned as a first-class value, its
lifetime can survive the stack frame in which it was created.
Arguments which are evaluated in a lazy fashion (Scheme “delays”
or “suspensions”) are similarly handled. Variable-length argument
“lists” themselves can be allocated in this fashion, allowing
these objects to become “first-class”. Finally, lazy allocation
correctly handles the allocation of a Scheme control stack,
allowing Scheme continuations to become first-class values.

We propose an abstract computer model and a programming language
– Psi-Lisp – whose primitive operations are injective and hence
reversible, thus allowing arbitrary undoing without the overheads
of checkpointing. Such a computer can be built from reversible
conservative logic circuits, with the serendipitous advantage of
dissipating far less heat than traditional Boolean AND/OR/NOT
circuits. Unlike functional languages, which have one “state” for
all times, Psi-Lisp has at all times one “state”, with unique
predecessor and successor states.

Compiling into a reversible pseudocode can have benefits even when
targeting a traditional computer. Certain optimizations, e.g.,
update-in-place, and compile-time garbage collection may be more
easily performed, because the information may be elicited without
the difficult and time-consuming iterative abstract interpretation
required for most non-reversible models.

In a reversible machine, garbage collection for recycling storage
can always be performed by a reversed (sub)computation. While this
“collection is reversed mutation” insight does not reduce space
requirements when used for the computation as a whole, it does
save space when used to recycle at finer scales. This insight also
provides an explanation for the fundamental importance of the
push-down stack both for recognizing palindromes and for managing
storage.

Generation-based garbage collection has been advocated by
appealing to the intuitive but vague notion that “young objects
are more likely to die than old objects”. The intuition is, that
if a generation-based garbage collection scheme focuses its effort
on scanning recently created objects, then its scanning efforts
will pay off more in the form of more recovered garbage, than if
it scanned older objects. In this note, we show a counterexample
of a system in which “infant mortality” is as high as you please,
but for which generational garbage collection is ineffective for
improving the average mark/cons ratio. Other benefits, such as
better locality and a smaller number of large delays, may still
make generational garbage collection attractive for such a system,
however.

We argue that intensional object identity in object-oriented
programming languages and databases is best defined operationally
by side-effect semantics. A corollary is that “functional” objects
have extensional semantics. This model of object identity, which
is analogous to the normal forms of relational algebra, provides
cleaner semantics for the value-transmission operations and
built-in primitive equality predicate of a programming language,
and eliminates the confusion surrounding “call-by-value” and
“call-by-reference” as well as the confusion of multiple equality
predicates.

Implementation issues are discussed, and this model is shown to
have significant performance advantages in persistent, parallel,
distributed and multilingual processing environments. This model
also provides insight into the “type equivalence” problem of
Algol-68, Pascal and Ada.

“Reference counting” can be an attractive form of dynamic storage
management. It recovers storage promptly and (with a garbage stack
instead of a free list) it can be made “real-time” – i.e., all
accesses can be performed in constant time. Its major drawbacks
are its inability to reclaim cycles, its count storage, and its
count update overhead. Update overhead is especially irritating
for functional (read-only) data where updates may dirty pristine
cache lines and pages.

We show how reference count updating can be largely eliminated for
functional data structures by using the “linear style” of
programming that is inspired by Girard’s linear logic, and by
distinguishing normal pointers from “anchored pointers”, which
indicate not only the object itself, but also the depth of the
stack frame that anchors the object. An “anchor” for a pointer is
essentially an enclosing data structure that is temporarily locked
from being collected for the duration of the anchored pointer’s
existence by a deferred reference count. An “anchored pointer”
thus implies a reference count increment that has been deferred
until it is either cancelled or performed.

Anchored pointers are generalizations of “borrowed” pointers and
“phantom” pointers. Anchored pointers can provide a solution to
the “derived pointer problem” in garbage collection.

We discuss the principles of statistical thermodynamics and their
application to storage management problems. We point out problems
which result from imprecise usage of the terms “information”,
“state”, “reversible”, “conservative”, etc.

Programming languages should have ‘use-once’ variables in addition
to the usual ‘multiple-use’ variables. ‘Use-once’ variables are
bound to linear (unshared, unaliased, or singly-referenced)
objects. Linear objects are cheap to access and manage, because
they require no synchronization or tracing garbage collection.
Linear objects can elegantly and efficiently solve otherwise
difficult problems of functional/mostly-functional systems –
e.g., in-place updating and the efficient initialization of
functional objects. Use-once variables are ideal for directly
manipulating resources which are inherently linear such as
freelists and ‘engine ticks’ in reflective languages.

A ‘use-once’ variable must be dynamically referenced exactly once
within its scope. Unreferenced use-once variables must be
explicitly killed, and multiply-referenced use-once variables must
be explicitly copied; this duplication and deletion is subject to
the constraint that some linear datatypes do not support
duplication and deletion methods. Use-once variables are bound
only to linear objects, which may reference other linear or
non-linear objects. Non-linear objects can reference other
non-linear objects, but can reference a linear object only in a
way that ensures mutual exclusion.

Although implementations have long had implicit use-once variables
and linear objects, most languages do not provide the programmer
any help for their utilization. For example, use-once variables
allow for the safe/controlled use of reified language
implementation objects like single-use continuations.

The International Workshop on Memory Management 1995 (IWMM‘95) is
a continuation of the excellent series started by Yves Bekkers and
Jacques Cohen with IWMM‘92. The present volume assembles the
refereed and invited technical papers which were presented during
this year’s workshop.

There is no well-defined memory-management library API which would
allow programmers to easily choose the best memory management
implementation for their application.

Some languages allow replacement of their memory management
functions, but usually only the program API is specified, hence
replacement of the entire program interface is required.

Few languages support multiple memory management policies within a
single program. Those that do use proprietary memory management
policies.

We believe that the design of an abstract program API is a
prerequisite to the design of a “server” API and eventually an API
that would permit multiple cooperating memory “servers”. If the
interface is simple yet powerful enough to encompass most memory
management systems, it stands a good chance of being widely
adopted.

Dynamic storage allocation is used heavily in many application
areas including interpreters, simulators, optimizers, and
translators. We describe research that can improve all aspects of
the performance of dynamic storage allocation by predicting the
lifetimes of short-lived objects when they are allocated. Using
five significant, allocation-intensive C programs, we show that a
great fraction of all bytes allocated are short-lived (> 90% in
all cases). Furthermore, we describe an algorithm for lifetime
prediction that accurately predicts the lifetimes of 42–99% of all
objects allocated. We describe and simulate a storage allocator
that takes advantage of lifetime prediction of short-lived objects
and show that it can significantly improve a program’s memory
overhead and reference locality, and even, at times, improve CPU
performance as well.

Generational techniques have been very successful in reducing the
impact of garbage collection algorithms upon the performance of
programs. However, it is impossible for designers of collection
algorithms to anticipate the memory allocation behavior of all
applications in advance. Existing generational collectors rely
upon the applications programmer to tune the behavior of the
collector to achieve maximum performance for each application.
Unfortunately, because the many tuning parameters require detailed
knowledge of both the collection algorithm and the program
allocation behavior in order to be used effectively, such tuning
is difficult and error prone. We propose a new garbage collection
algorithm that uses just two easily understood tuning parameters
that directly reflect the maximum memory and pause time
constraints familiar to application programmers and users.

Like generational collectors, ours divides memory into two spaces,
one for short-lived, and another for long-lived objects. Unlike
previous work, our collector dynamically adjusts the boundary
between these two spaces in order to directly meet the resource
constraints specified by the user. We describe two methods for
adjusting this boundary, compare them with several existing
algorithms, and show how effectively ours meets the specified
constraints. Our pause time collector saved memory by holding
median pause times closer to the constraint than the other pause
time constrained algorithm and, when not over-constrained, our
memory constrained collector exhibited the lowest CPU overhead of
the algorithms we measured yet was capable of maintaining a
maximum memory constraint.

This paper introduces a copying garbage collection algorithm which
is able to compact most of the accessible storage in the heap
without having an explicitly defined set of pointers that contain
all the roots of all accessible storage. Using “hints” found in
the processor’s registers and stack, the algorithm is able to
divide heap allocated objects into two groups: those that might be
referenced by a pointer in the stack or registers, and those that
are not. The objects which might be referenced are left in place,
and the other objects are copied into a more compact
representation.

A Lisp compiler and runtime system which uses such a collector
need not have complete control of the processor in order to force
a certain discipline on the stack and registers. A Scheme
implementation has been done for the Digital WRL Titan processor
which uses a garbage collector based on this “mostly copying”
algorithm. Like other languages for the Titan, it uses the Mahler
intermediate language as its target. This simplifies the compiler
and allows it to take advantage of the significant machine
dependent optimizations provided by Mahler. The common
intermediate language also simplifies call-outs from Scheme
programs to functions written in other languages and call-backs
from functions in other languages.

Measurements of the Scheme implementation show that the algorithm
is efficient, as little unneeded storage is retained and only a
very small fraction of the heap is left in place.

Simple pointer manipulation protocols also mean that compiler
support is not needed in order to correctly handle pointers. Thus
it is reasonable to provide garbage collected storage in languages
such as C. A collector written in C which uses this algorithm is
included in the Appendix.

The “mostly-copying” garbage collection algorithm provides a way
to perform compacting garbage collection in spite of the presence
of ambiguous pointers in the root set. As originally defined, each
collection required almost all accessible objects to be moved.
While adequate for many applications, programs that retained a
large amount of storage spent a significant amount of time garbage
collecting. To improve performance of these applications, a
generational version of the algorithm has been designed. This note
reports on this extension of the algorithm, and its application in
collectors for Scheme and C++.

In this paper, we present Hoard, a memory allocator for
shared-memory multiprocessors. We prove that its worst-case memory
fragmentation is asymptotically equivalent to that of an optimal
uniprocessor allocator. We present experiments that demonstrate
its speed and scalability.

Current general-purpose memory allocators do not provide
sufficient speed or flexibility for modern high-performance
applications. Highly-tuned general purpose allocators have
per-operation costs around one hundred cycles, while the cost of
an operation in a custom memory allocator can be just a handful of
cycles. To achieve high performance, programmers often write
custom memory allocators from scratch – a difficult and
error-prone process.

In this paper, we present a flexible and efficient infrastructure
for building memory allocators that is based on C++ templates and
inheritance. This novel approach allows programmers to build
custom and general-purpose allocators as “heap layers” that can be
composed without incurring any additional runtime overhead or
additional programming cost. We show that this infrastructure
simplifies allocator construction and results in allocators that
either match or improve the performance of heavily-tuned
allocators written in C, including the Kingsley allocator and the
GNU obstack library. We further show this infrastructure can be
used to rapidly build a general-purpose allocator that has
performance comparable to the Lea allocator, one of the best
uniprocessor allocators available. We thus demonstrate a clean,
easy-to-use allocator interface that seamlessly combines the power
and efficiency of any number of general and custom allocators
within a single application.

We describe a technique for storage allocation and garbage
collection in the absence of significant co-operation from the
code using the allocator. This limits garbage collection overhead
to the time actually required for garbage collection. In
particular, application programs that rarely or never make use of
the collector no longer encounter a substantial performance
penalty. This approach greatly simplifies the implementation of
languages supporting garbage collection. It further allows
conventional compilers to be used with a garbage collector, either
as the primary means of storage reclamation, or as a debugging
tool.

We present a method for adapting garbage collectors designed to
run sequentially with the client, so that they may run
concurrently with it. We rely on virtual memory hardware to
provide information about pages that have been updated or
“dirtied” during a given period of time. This method has been used
to construct a mostly parallel trace-and-sweep collector that
exhibits very short pause times. Performance measurements are
given.

Conservative garbage collectors are commonly used in combination
with conventional C programs. Empirically, this usually works
well. However, there are no guarantees that this is safe in the
presence of “improved” compiler optimization. We propose that C
compilers provide a facility to suppress optimizations that are
unsafe in the presence of conservative garbage collection. Such a
facility can be added to an existing compiler at very minimal
cost, provided the additional analysis is done in a
machine-independent source-to-source prepass. Such a prepass may
also check the source code for garbage-collector-safety.

We call a garbage collector conservative if it has only partial
information about the location of pointers, and is thus forced to
treat arbitrary bit patterns as though they might be pointers, in
at least some cases. We show that some very inexpensive, but
previously unused techniques can have dramatic impact on the
effectiveness of conservative garbage collectors in reclaiming
memory. Our most significant observation is that static data that
appears to point to the heap should not result in misidentified
reference to the heap. The garbage collector has enough
information to allocate around such references. We also observe
that programming style has a significantly impact on the amount of
spuriously retained storage, typically even if the collector is
not terribly conservative. Some fairly common C and C++
programming styles significantly decrease the effectiveness of any
garbage collector. These observations suffice to explain some of
the different assessments of conservative collection that have
appeared in the literature.

Cache misses are currently a major factor in the cost of garbage
collection, and we expect them to dominate in the future.
Traditional garbage collection algorithms exhibit relatively litle
temporal locality; each live object in the heap is likely to be
touched exactly once during each garbage collection. We measure
two techniques for dealing with this issue: prefetch-on-grey, and
lazy sweeping. The first of these is new in this context. Lazy
sweeping has been in common use for a decade. It was introduced as
a mechanism for reducing paging and pause times; we argue that it
is also crucial for eliminating cache misses during the sweep
phase.

Our measurements are obtained in the context of a non-moving
garbage collector. Fully copying garbage collection inherently
requires more traffic through the cache, and thus probably also
stands to benefit substantially from something like the
prefetch-on-grey technique. Generational garbage collection may
reduce the benefit of these techniques for some applications, but
experiments with a non-moving generational collector suggest that
they remain quite useful.

Conservative garbage collectors can automatically reclaim unused
memory in the absence of precise pointer location information. If
a location can possibly contain a pointer, it is treated by the
collector as though it contained a pointer. Although it is
commonly assumed that this can lead to unbounded space use due to
misidentified pointers, such extreme space use is rarely observed
in practice, and then generally only if the number of
misidentified pointers is itself unbounded. We show that if the
program manipulates only data structures satisfying a simple
GC-robustness criterion, then a bounded number of misidentified
pointers can result at most in increasing space usage by a
constant factor. We argue that nearly all common data structures
are already GC- robust, and it is typically easy to identify and
replace those that are not. Thus it becomes feasible to prove
space bounds on programs collected by mildly conservative garbage
collectors, such as the one in Barabash et al. (2001). The
worst-case space overhead introduced by such mild conservatism is
comparable to the worst-case fragmentation overhead for inherent
in any non-moving storage allocator. The same GC-robustness
criterion also ensures the absence of temporary space leaks of the
kind discussed in Rojemo (1995) for generational garbage
collectors.

We compare two different facilities for running cleanup actions
for objects that are about to reach the end of their life.
Destructors, such as we find in C++, are invoked synchronously
when an object goes out of scope. They make it easier to implement
cleanup actions for objects of well-known lifetime, especially in
the presence of exceptions. Languages like Java, Modula-3, and C#
provide a different kind of “finalization” facility: Cleanup
methods may be run when the garbage collector discovers a heap
object to be otherwise inaccessible. Unlike C++ destructors, such
methods run in a separate thread at some much less well-defined
time. We argue that these are fundamentally different, and
potentially complementary, language facilities. We also try to
resolve some common misunderstandings about finalization in the
process. In particular: 1. The asynchronous nature of finalizers
is not just an accident of implementation or a shortcoming of
tracing collectors; it is necessary for correctness of client
code, fundamentally affects how finalizers must be written, and
how finalization facilities should be presented to the user. 2. An
object may legitimately be finalized while one of its methods are
still running. This should and can be addressed by the language
specification and client code.

An algorithm is presented that searches for the location, “i,”
of the first occurrence of a character string, “pat,” in another
string, “string.” During the search operation, the characters of
pat are matched starting with the last character of pat. The
information gained by starting the match at the end of the pattern
often allows the algorithm to proceed in large jumps through the
text being searched. Thus the algorithm has the unusual property
that, in most cases, not all of the first i characters of
string are inspected. The number of characters actually
inspected (on the average) decreases as a function of the length
of pat. For a random English pattern of length 5, the algorithm
will typically inspect i/4 characters of string before finding a
match at i. Furthermore, the algorithm has been implemented so
that (on the average) fewer than i + patlen machine
instructions are executed. These conclusions are supported with
empirical evidence and a theoretical analysis of the average
behavior of the algorithm. The worst case behavior of the
algorithm is linear in i + patlen, assuming the availability
of array space for tables linear in patlen plus the size of the
alphabet.

Between 1994 and 2001, Harlequin (now part of Global Graphics)
invested about thirty person-years of effort developing the MPS.
The system contained many innovative techniques and abstractions
which were kept secret. In 1997 Richard Brooksby, the manager and
chief architect of the project, and Nicholas Barnes, a senior
developer, left Harlequin to form their own consultancy company,
Ravenbrook, and in 2001, Ravenbrook acquired the MPS technology
from Global Graphics. We are happy to announce that we are
publishing the source code and documentation under an open source
licence. This paper gives an overview of the system.

Improving the performance of C programs has been a topic of great
interest for many years. Both hardware technology and compiler
optimization research has been applied in an effort to make C
programs execute faster. In many application domains, the C++
language is replacing C as the programming language of choice. In
this paper, we measure the empirical behavior of a group of
significant C and C++ programs and attempt to identify and
quantify behavioral differences between them. Our goal is to
determine whether optimization technology that has been successful
for C programs will also be successful in C++ programs. We
furthermore identify behavioral characteristics of C++ programs
that suggest optimizations that should be applied in those
programs. Our results show that C++ programs exhibit behavior that
is significantly different than C programs. These results should
be of interest to compiler writers and architecture designers who
are designing systems to execute object-oriented programs.

We describe a new method for determining when an object can be
garbage collected. The method does not require marking live
objects. Instead, each object X is dynamically associated with
a stack frame M, such that X is collectable when M pops.
Because X could have been dead earlier, our method is
conservative. Our results demonstrate that the methos nonetheless
idenitifies a large percentage of collectable objects. The method
has been implemented in Sun’s Java™ Virtual Machine interpreter,
and results are presented based on this implementation.

A new, high performance Smalltalk-80™ implementation is described
which builds directly upon two previous implementation efforts.
This implementation supports a large object space while retaining
compatibility with previous Smalltalk-80™ images. The
implementation utilizes a interpreter which incorporates a
generation based garbage collector and which does not have an
object table. This paper describes the design decisions which lead
to this implementation and reports preliminary performance
results.

A simple nonrecursive list structure compacting scheme or garbage
collector suitable for both compact and LISP-like list structures
is presented. The algorithm avoids the need for recursion by using
the partial structure as it is built up to keep track of those
lists that have been copied.

This paper presents two techniques for improving garbage
collection performance: generational stack collection and
profile-driven pretenuring. The first is applicable to stack-based
implementations of functional languages while the second is useful
for any generational collector. We have implemented both
techniques in a generational collector used by the TIL compiler,
and have observed decreases in garbage collection times of as much
as 70% and 30%, respectively.

Functional languages encourage the use of recursion which can lead
to a long chain of activation records. When a collection occurs,
these activation records must be scanned for roots. We show that
scanning many activation records can take so long as to become the
dominant cost of garbage collection. However, most deep stacks
unwind very infrequently, so most of the root information obtained
from the stack remains unchanged across successive garbage
collections. Generational stack collection greatly reduces the
stack scan cost by reusing information from previous scans.

Generational techniques have been successful in reducing the cost
of garbage collection. Various complex heap arrangements and
tenuring policies have been proposed to increase the effectiveness
of generational techniques by reducing the cost and frequency of
scanning and copying. In contrast, we show that by using profile
information to make lifetime predictions, pretenuring can avoid
copying data altogether. In essence, this technique uses a
refinement of the generational hypothesis (most data die young)
with a locality principle concerning the age of data: most
allocations sites produce data that immediately dies, while a few
allocation sites consistently produce data that survives many
collections.

Processor and memory technology trends show a continual increase
in the cost of accessing main memory. Machine designers have tried
to mitigate the effect of this trend through a variety of
techniques that attempt to reduce or tolerate memory latency.
These techniques, unfortunately, have only been partially
successful for pointer-manipulating programs. Recent research has
demonstrated that these programs can benefit greatly from the
complementary approach of reorganizing pointer data structures to
improve cache locality. This paper describes how a generational
garbage collector can be used to achieve a cache-conscious data
layout, in which objects with high temporal affinity are placed
next to each other, so they are likely to reside in the same cache
block. The paper demonstrates the feasibility of collecting low
overhead, real-time profiling information about data access
patterns for object-oriented languages, and describes a new
copying algorithm that utilizes this information to produce a
cache-conscious object layout. Preliminary results indicate that
this technique reduces cache miss rates by 21-42%, and improves
program performance by 14-37%.

If a fixed exponentially decreasing probability distribution
function is used to model every object’s lifetime, then the age of
an object gives no information about its future life expectancy.
This radioactive decay model implies that there can be no
rational basis for deciding which live objects should be promoted
to another generation. Yet there remains a rational basis for
deciding how many objects to promote, when to collect garbage, and
which generations to collect.

Analysis of the model leads to a new kind of generational garbage
collector whose effectiveness does not depend upon heuristics that
predict which objects will live longer than others.

This result provides insight into the computational advantages of
generational garbage collection, with implications for the
management of objects whose life expectancies are difficult to
predict.

A concise and unified view of the numerous existing algorithms for
performing garbage collection of linked data structures is
presented. The emphasis is on garbage collection proper, rather
than on storage allocation.

First, the classical garbage collection algorithms and their
marking and collecting phases, with and without compacting, are
discussed.

Algorithms describing these phases are classified according to the
type of cells to be collected: those for collecting single-sized
cells are simpler than those for varisized cells. Recently
proposed algorithms are presented and compared with the classical
ones. Special topics in garbage collection are also covered. A
bibliography with topical annotations is included.

Mark and sweep garbage collectors (GC) are classical but still
very efficient automatic memory management systems. Although
challenged by other kinds of systems, such as copying collectors,
mark and sweep collectors remain among the best in terms of
performance.

This paper describes our implementation of an efficient mark and
sweep garbage collector tailored to each program. Compiler support
provides the type information required to statically and
automatically generate this customized garbage collector. The
segregation of object by type allows the production of a more
efficient GC code. This technique, implemented in SmallEiffel, our
compiler for the object-oriented language Eiffel, is applicable to
other languages and other garbage collection algorithms, be they
distributed or not.

We present the results obtained on programs featuring a variety of
programming styles and compare our results to a well-known and
high-quality garbage collector.

The automatic reclamation of storage for unreferenced objects is
very important in object databases. Existing language system
algorithms for automatic storage reclamation have been shown to be
inappropriate. In this paper, we investigate methods to improve
the performance of algorithms for automatic storage reclamation of
object databases. These algorithms are based on a technique called
partitioned garbage collection, in which a subset of the entire
database is collected independently of the rest. Specifically, we
investigate the policy that is used to select what partition in
the database should be collected. The new partition selection
policies that we propose and investigate are based on the
intuition that the values of overwritten pointers provide good
hints about where to find garbage. Using trace-driven simulation,
we show that one of our policies requires less I/O to collect more
garbage than any existing implementable policy and performs close
to an impractical-to-implement but near-optimal policy over a wide
range of database sizes and connectivities.

A fundamental problem in automating object database storage
reclamation is determining how often to perform garbage
collection. We show that the choice of collection rate can have a
significant impact on application performance and that the “best”
rate depends on the dynamic behavior of the application, tempered
by the particular performance goals of the user. We describe two
semi-automatic, self-adaptive policies for controlling collection
rate that we have developed to address the problem. Using
trace-driven simulations, we evaluate the performance of the
policies on a test database application that demonstrates two
distinct reclustering behaviors. Our results show that the
policies are effective at achieving user-specified levels of I/O
operations and database garbage percentage. We also investigate
the sensitivity of the policies over a range of object
connectivities. The evaluation demonstrates that semi-automatic,
self-adaptive policies are a practical means for flexibly
controlling garbage collection rate.

We improved the performance of garbage collection in the Standard ML of
New Jersey system by using the virtual memory facilities provided by
the Mach kernel. We took advantage of Mach’s support for large sparse
address spaces and user-defined paging servers. We decreased the
elapsed time for realistic applications by as much as a factor of 4.

Using techniques developed in the classroom at America Online’s
Programmer’s University, Michael Daconta deftly pilots programmers
through the intricacies of the two most difficult aspects of C++
programming: pointers and dynamic memory management. Written by a
programmer for programmers, this no-nonsense, nuts-and-bolts guide
shows you how to fully exploit advanced C++ programming features,
such as creating class-specific allocators, understanding
references versus pointers, manipulating multidimensional arrays
with pointers, and how pointers and dynamic memory are the core of
object-oriented constructs like inheritance, name-mangling, and
virtual functions.

A particularly troublesome phenomenon, thrashing, may seriously
interfere with the performance of paged memory systems, reducing
computing giants (Multics, IBM System 360, and others not
necessarily excepted) to computing dwarfs. The term thrashing
denotes excessive overhead and severe performance degradation or
collapse caused by too much paging. Thrashing inevitably turns a
shortage of memory space into a surplus of processor time.

The need for automatic storage allocation arises from desires for
program modularity, machine independence, and resource sharing.
Virtual memory is an elegant way of achieving these objectives. In
a virtual memory, the addresses a program may use to identify
information are distinguished from the addresses the memory system
uses to identify physical storage sites, and program-generated
addresses are translated automatically to the corresponding
machine addresses. Two principal methods for implementing virtual
memory, segmentation and paging, are compared and contrasted. Many
contemporary implementations have experienced one or more of these
problems: poor utilization of storage, thrashing, and high costs
associated with loading information into memory. These and
subsidiary problems are studied from a theoretic view, and are
shown to be controllable by a proper combination of hardware and
memory management policies.

A program’s working set W(t, T) at time t is the set of
distinct pages among the T most recently referenced pages.
Relations between the average working-set size, the missing-page
rate, and the interreference-interval distribution may be derived
both from time-average definitions and from ensemble-average
(statistical) definitions. An efficient algorithm for estimating
these quantities is given. The relation to LRU (least recently
used) paging is characterized. The independent-reference model, in
which page references are statistically independent, is used to
assess the effects of interpage dependencies on working-set size
observations. Under general assumptions, working-set size is shown
to be normally distributed.

Automatic storage management, or garbage collection, is a
feature that can ease program development and enhance program
reliability. Many high-level languages other than C++ provide
garbage collection. This paper proposes the use of “smart pointer”
template classes as an interface for the use of garbage collection
in C++. Template classes and operator overloading are techniques
allowing language extension at the level of user code; I claim
that using these techniques to create smart pointer classes
provdes a syntax for manipulating garbage-collected storage safely
and conveniently. Further, the use of a smart-pointer template
class offers the possibility of implementing the collector at the
user-level, without requiring support from the compiler. If such a
compiler-independent implementation is possible with adequate
performance, then programmers can start to write code using
garbage collection without waiting for language and compiler
modifications. If the use of such a garbage collection interface
becomes widespread, then C++ compilation systems can be built to
specially support tht garbage collection interface, thereby
allowing the use of collection algorithms with enhanced
performance.

Dynamic storage allocation is an important part of a large class
of computer programs written in C and C++. High-performance
algorithms for dynamic storage allocation have been, and will
continue to be, of considerable interest. This paper presents
detailed measurements of the cost of dynamic storage allocation in
11 diverse C and C++ programs using five very different dynamic
storage allocation implementations, including a conservative
garbage collection algorithm. Four of the allocator
implementations measured are publicly-available on the Internet. A
number of the programs used in these measurements are also
available on the Internet to facilitate further research in
dynamic storage allocation. Finally, the data presented in this
paper is an abbreviated version of more extensive statistics that
are also publicly-available on the Internet.

This paper describes a new way of solving the storage reclamation
problem for a system such as Lisp that allocates storage
automatically from a heap, and does not require the programmer to
give any indication that particular items are no longer useful or
accessible. A reference count scheme for reclaiming
non-self-referential structures, and a linearizing, compacting,
copying scheme to reorganize all storage at the users discretion
are proposed. The algorithms are designed to work well in systems
which use multiple levels of storage, and large virtual address
space. They depend on the fact that most cells are referenced
exactly once, and that reference counts need only be accurate when
storage is about to be reclaimed. A transaction file stores
changes to reference counts, and a multiple reference table stores
the count for items which are referenced more than once.

As an example of cooperation between sequential processes with
very little mutual interference despite frequent manipulations of
a large shared data space, a technique is developed which allows
nearly all of the activity needed for garbage detection and
collection to be performed by an additional processor operating
con- currently with the processor devoted to the computation
proper. Exclusion and synchronization constraints have been kept
as weak as could be achieved; the severe complexities engendered
by doing so are illustrated.

We consider the problem of supporting compacting garbage
collection in the presence of modern compiler optimizations. Since
our collector may move any heap object, it must accurately locate,
follow, and update all pointers and values derived from pointers.
To assist the collector, we extend the compiler to emit tables
describing live pointers, and values derived from pointers, at
each program location where collection may occur. Significant
results include identification of a number of problems posed by
optimizations, solutions to those problems, a working compiler,
and experimental data concerning table sizes, table compression,
and time overhead of decoding tables during collection. While gc
support can affect the code produced, our sample programs show no
significant changes, the table sizes are a modest fraction of the
size of the optimized code, and stack tracing is a small fraction
of total gc time. Since the compiler enhancements are also modest,
we conclude that the approach is practical.

Heap allocation with copying garbage collection is a general
storage management technique for modern programming languages. It
is believed to have poor memory subsystem performance. To
investigate this, we conducted an in-depth study of the memory
subsystem performance of heap allocation for memory subsystems
found on many machines. We studied the performance of
mostly-functional Standard ML programs which made heavy use of
heap allocation. We found that most machines support heap
allocation poorly. However, with the appropriate memory subsystem
organization, heap allocation can have good performance. The
memory subsystem property crucial for achieving good performance
was the ability to allocate and initialize a new object into the
cache without a penalty. This can be achieved by having subblock
placement with a subblock size of one word with a write allocate
policy, along with fast page-mode writes or a write buffer. For
caches with subblock placement, the data cache overhead was under
9% for a 64k or larger data cache; without subblock placement the
overhead was often higher than 50%.

Heap allocation with copying garbage collection is believed to
have poor memory subsystem performance. We conducted a study of
the memory subsystem performance of heap allocation for memory
subsystems found on many machines. We found that many machines
support heap allocation poorly. However, with the appropriate
memory subsystem organization, heap allocation can have good
memory subsystem performance.

This paper presents the design and implementation of a “quasi
real-time” garbage collector for Concurrent Caml Light, an
implementation of ML with threads. This two-generation system
combines a fast, asynchronous copying collector on the young
generation with a non-disruptive concurrent marking collector on
the old generation. This design crucially relies on the ML
compile-time distinction between mutable and immutable objects.

We describe and prove the correctness of a new concurrent
mark-and-sweep garbage collection algorithm. This algorithm
derives from the classical on-the-fly algorithm from Dijkstra et
al. A distinguishing feature of our algorithm is that it supports
multiprocessor environments where the registers of running
processes are not readily accessible, without imposing any
overhead on the elementary operations of loading a register or
reading or initializing a field. Furthermore our collector never
blocks running mutator processes except possibly on requests for
free memory; in particular, updating a field or creating or
marking or sweeping a heap object does not involve
system-dependent synchronization primitives such as locks. We also
provide support for process creation and deletion, and for
managing an extensible heap of variable-sized objects.

This paper describes a new language feature that allows
dynamically allocated objects to be saved from deallocation by an
automatic storage management system so that clean-up or other
actions can be performed using the data stored within the objects.
The program has full control over the timing of clean-up actions,
which eliminates several potential problems and often eliminates
the need for critical sections in code that interacts with
clean-up actions. Our implementation is “generation-friendly” in
the sense that the additional overhead within the mutator is
proportional to the number of clean-up actions actually performed.

This paper shows hhow the behaviour of smart pointers diverges
from that of pointers in certain common C++ constructs. Given
this, we conclude that the C++ programming language does not
support seamless smart pointers: smart pointers cannot
transparently replace raw pointers in all ways except declaration
syntax. We show that this conclusion also applies to accessors.

Our research is concerned with compiler- independent, tag-free
garbage collection for the C++ programming language. This paper
presents a mark-and-sweep collector, and explains how it
ameliorates shortcomings of a previous copy collector. The new
collector, like the old, uses C++’s facilities for creating
abstract data types to define a tracked reference type, called
roots, at the level of the application program. A programmer
wishing to utilize the garbage collection service uses these roots
in place of normal, raw pointers. We present a detailed study of
the cost of using roots, as compared to both normal pointers and
reference counted pointers, in terms of instruction counts. We
examine the efficiency of a small C++ application using roots,
reference counting, manual reclamation, and conservative
collection. Coding the application to use garbage collection, and
analyzing the resulting efficiency, helped us identify a number of
memory leaks and inefficiencies in the original, manually
reclaimed version. We find that for this program, garbage
collection using roots is much more efficient than reference
counting, though less efficient than manual reclamation. It is
hard to directly compare our collector to the conservative
collector because of the differing efficiencies of their
respective memory allocators.

(This short memo doesn’t have an abstract. Basically, it describes
the plan for the LISP II Relocating Garbage Collector. It has four
phases: marking, collection, relocation and moving. Marking is by
recursive descent using a bit table. The remaining phases are
linear sweeps through the bit table. The collection phase
calculates how much everything needs to move, storing this
information in the free blocks. The relocation phase updates all
relocatable addresses. The moving phase moves the surviving
objects into one contiguous block.)

We propose adding safe, efficient garbage collection to C++,
eliminating the possibility of storage-management bugs and making
the design of complex, object-oriented systems much easier. This
can be accomplished with almost no change to the language itself
and only small changes to existing implementations, while
retaining compatibility with existing class libraries.

The model of Larchant is that of a Shared Address Space
(spanning every site in a network including secondary storage)
with Persistence By Reachability. To provide the illusion of a
shared address space across the network, despite the fact that
site memories are disjoint, Larchant implements a distributed
shared memory mechanism. Reachability is accessed by tracing the
pointer graph, starting from the persistent root, and reclaiming
unreachable objects. This is the task of Garbage Collection
(GC).

GC was until recently thought to be intractable in a large-scale
system, due to problems of scale, incoherence, asynchrony, and
performance. This thesis presents the solutions that Larchant
proposes to these problems.

The GC algorithm in Larchant combines tracing and
reference-listing. It traces whenever economically feasible, i.e.,
as long as the memory subset being collected remains local to a
site, and counts references that would cost I/O traffic to trace.
GC is orthogonal to coherence, i.e., makes progress even if only
incoherent replicas are locally available. The garbage collector
runs concurrently and asynchronously to applications. The
reference-listing boundary changes dynamically and seamlessly, and
independently at each site, in order to collect cycles of
unreachable objects.

We prove formally that our GC algorithm is correct, i.e., it is
safe and live. The performance results from our Larchant prototype
show that our design goals (scalability, coherence orthogonality,
and good performance) are fulfilled.

Caching and persistence support efficient, convenient and
transparent distributed data sharing. The most natural model of
persistence is persistence by reachability, managed automatically
by a garbage collector (GC). We propose a very general model of
such a system (based on distributed shared memory) and a scalable,
asynchronous distributed GC algorithm. Within this model, we show
sufficient and widely applicable correctness conditions for the
interactions between applications, store, memory, coherence, and
GC.

The GC runs as a set of processes (local to each participating
machine) communicating by asynchronous messages. Collection does
not interfere with applications by setting locks, polluting
caches, or causing I/O; this requirement raised some novel and
interesting challenges which we address in this article. The
algorithm is safe and live; it is not complete, i.e. it collects
some distributed cycles of garbage but not necessarily all.

A new algorithm is introduced for garbage collecting a heap which
contains shared data structures accessed from a scatter table. The
scheme provides for the purging of useless entries from the
scatter table with no traverslas beyond the two required by
classic collection schemes. For languages which use scatter tables
to sustain unique existence of complex structures, like natural
variables of SNOBOL, it indirectly allows liberal use of a single
scatter table by ensuring efficient deletion of useless entries.
Since the scatter table is completely restructured during the
course of execution, the hashing scheme itself is easily altered
during garbage collection whenever skewed loading of the scatter
table warrants abandonment of the old hashing. This procedure is
applicable to the maintenance of dynamic structures such as those
in information retrieval schemes or in languages like LISP and
SNOBOL.

Deutsch and Bobrow propose a storage reclamation scheme for a heap
which is a hybrid of garbage collection and reference counting.
The point of the hybrid scheme is to keep track of very low
reference counts between necessary invocation of garbage
collection so that nodes which are allocated and rather quickly
abandoned can be returned to available space, delaying necessity
for garbage collection. We show how such a scheme may be
implemented using the mark bit already required in every node by
the garbage collector. Between garbage collections that bit is
used to distinguish nodes with a reference count known to be one.
A significant feature of our scheme is a small cache of references
to nodes whose implemented counts “ought to be higher” which
prevents the loss of logical count information in simple
manipulations of uniquely referenced structures.

In this note we advance reference counting as a storage management
technique viable for implementing recursive languages like ISWIM
or pure LISP with the labels construct for implementing mutual
recursion from SCHEME. Labels is derived from letrec and
displaces the label operator, a version of the paradoxical
Y-combinator. The key observation is that the requisite circular
structure (which ordinarily cripples reference counts) occurs only
within the language–rather than the user–structure, and that the
references into this structure are well-controlled.

The allocation and disposal of memory is a ubiquitous operation in
most programs. Rarely do programmers concern themselves with
details of memory allocators; most assume that memory allocators
provided by the system perform well. This paper presents a
performance evaluation of the reference locality of dynamic
storage allocation algorithms based on trace-driven simulation of
five large allocation-intensive C programs. In this paper, we show
how the design of a memory allocator can significantly affect the
reference locality for various applications. Our measurements show
that poor locality in sequential-fit algorithms reduces program
performance, both by increasing paging and cache miss rates. While
increased paging can be debilitating on any architecture, cache
misses rates are also important for modern computer architectures.
We show that algorithms attempting to be space-efficient, by
coalescing adjacent free objects show poor reference locality,
possibly negating the benefits of space efficiency. At the other
extreme, algorithms can expend considerable effort to increase
reference locality yet gain little in total execution performance.
Our measurements suggest an allocator design that is both very
fast and has good locality of reference.

The allocation and disposal of memory is a ubiquitous operation in
most programs. Rarely do programmers concern themselves with
details of memory allocators; most assume that memory allocators
provided by the system perform well. Yet, in some applications,
programmers use domain-specific knowledge in an attempt to improve
the speed or memory utilization of memory allocators. In this
paper, we describe a program (CustoMalloc) that synthesizes a
memory allocator customized for a specific application. Our
experiments show that the synthesized allocators are uniformly
faster than the common binary-buddy (BSD) allocator, and are more
space efficient. Constructing a custom allocator requires little
programmer effort. The process can usually be accomplished in a
few minutes, and yields results superior even to domain-specific
allocators designed by programmers. Our measurements show the
synthesized allocators are from two to ten times faster than
widely used allocators.

This report is a discussion of various techniques for representing
type information in dynamically typed languages, as implemented on
general-purpose machines (and costs are discussed in terms of
modern RISC machines). It is intended to make readily available a
large body of knowledge that currently has to be absorbed
piecemeal from the literature or re-invented by each language
implementor. This discussion covers not only tagging schemes but
other forms of representation as well, although the discussion is
strictly limited to the representation of type information. It
should also be noted that this report does not purport to contain
a survey of the relevant literature. Instead, this report gathers
together a body of folklore, organizes it into a logical
structure, makes some generalizations, and then discusses the
results in terms of modern hardware.

This article presents a technique for allowing the early recovery
of storage space occupied by garbage data. The idea is similar to
that of generational garbage collection, except that the heap is
partitioned based on a static analysis of data type definitions
rather than on the approximate age of allocated objects. A
prototype implementation is presented, along with initial results
and ideas for future work.

This paper presents a new model for scheduling the work of an
incremental garbage collector in a system with hard real time
requirements. The method utilizes the fact that just some of the
processes in the system have to meet hard real time requirements
and that these processes typically run periodically, a fact that
we can make use of when scheduling the garbage collection. The
work of the collector is scheduled to be performed in the pauses
between the critical processes and is suspended when the processes
with hard real time requirements run. It is shown that this
approach is feasible for many real time systems and that it leaves
the time-critical parts of the system undisturbed from garbage
collection induced delays.

Incremental algorithms are often used to interleave the work of a
garbage collector with the execution of an application program,
the intention being to avoid long pauses. However, overestimating
the worst-case storage needs of the program often causes all the
garbage collection work to be performed in the beginning of the
garbage collection cycles, slowing down the application program to
an unwanted degree. This paper explores an approach to
distributing the work more evenly over the garbage collection
cycle.

The complexity of systems for automatic control and other
safety-critical applications grows rapidly. Computer software
represents an increasing part of the complexity. As larger systems
are developed, we need to find scalable techniques to manage the
complexity in order to guarantee high product quality. Memory
management is a key quality factor for these systems. Automatic
memory management, or garbage collection, is a technique that
significantly reduces the complex problem of correct memory
management. The risk of software errors decreases and development
time is reduced.

Garbage collection techniques suitable for interactive and soft
real-time systems exist, but few approaches are suitable for
systems with hard real-time requirements, such as control systems
(embedded systems). One part of the problem is solved by
incremental garbage collection algorithms, which have been
presented before. We focus on the scheduling problem which forms
the second part of the problem, i.e. how the work of a garbage
collector should be scheduled in order to disturb the application
program as little as possible. It is studied how a priori
scheduling analysis of systems with automatic memory management
can be made. The field of garbage collection research is thus
joined with the field of scheduling analysis in order to produce a
practical synthesis of the two fields.

A scheduling strategy is presented that employs the properties of
control systems to ensure that no garbage collection work is
performed during the execution of critical processes. The hard
real-time part of the system is thus never disturbed by garbage
collection work. Existing incremental garbage collection
algorithms are adapted to the presented strategy. Necessary
modifications of the algorithms and the real-time kernel are
discussed. A standard scheduling analysis technique, rate
monotonic analysis, is extended in order to make a priori analysis
of the schedulability of the garbage collector possible.

The scheduling algorithm has been implemented in an industrially
relevant real-time environment in order to show that the strategy
is feasible in practice. The experimental evaluation shows that
predictable behaviour and sub-millisecond worst-case delays can be
achieved on standard hardware even by a non-optimized prototype
garbage collector.

Reachability-based persistence imposes new requirements for main
memory management in general, and garbage collection in
particular. After a brief introduction to the characteristics and
requirements of reachability-based persistence, we present the
design of a run-time storage manager for Persistent Smalltalk and
Persistent Modula-3, which allows the reclamation of storage from
both temporary objects and buffered persistent objects.

Generational garbage collectors are able to achieve very small
pause times by concentrating on the youngest (most recently
allocated) objects when collecting, since objects have been
observed to die young in many systems. Generational collectors
must keep track of all pointers from older to younger generations,
by “monitoring” all stores into the heap. This write barrier has
been implemented in a number of ways, varying essentially in the
granularity of the information observed and stored. Here we
examine a range of write barrier implementations and evaluate
their relative performance within a generation scavenging garbage
collector for Smalltalk.

Remembered sets and dirty bits have been proposed as alternative
implementations of the write barrier for garbage collection. There
are advantages to both approaches. Dirty bits can be efficiently
maintained with minimal, bounded overhead per store operation,
while remembered sets concisely, and accurately record the
necessary information. Here we present evidence to show that
hybrids can combine the virtues of both schemes and offer
competitive performance. Moreover, we argue that a hybrid can
better avoid the devils that are the downfall of the separate
alternatives.

Many operating systems allow user programs to specify the
protection level (inaccessible, read-only, read-write) of pages in
their virtual memory address space, and to handle any protection
violations that may occur. Such page-protection techniques have
been exploited by several user-level algorithms for applications
including generational garbage collection and persistent stores.
Unfortunately, modern hardware has made efficient handling of page
protection faults more difficult. Moreover, page-sized granularity
may not match the natural granularity of a given application. In
light of these problems, we reevaluate the usefulness of
page-protection primitives in such applications, by comparing the
performance of implementations that make use of the primitives
with others that do not. Our results show that for certain
applications software solutions outperform solutions that rely on
page-protection or other related virtual memory primitives.

We describe a memory management toolkit for language implementors.
It offers efficient and flexible generation scavenging garbage
collection. In addition to providing a core of
language-independent algorithms and data structures, the toolkit
includes auxiliary components that ease implementation of garbage
collection for programming languages. We have detailed designs for
Smalltalk and Modula-3 and are confident the toolkit can be used
with a wide variety of languages. The toolkit approach is itself
novel, and our design includes a number of additional innovations
in flexibility, efficiency, accuracy, and cooperation between the
compiler and the collector.

We present a garbage collection algorithm that extends
generational scavenging to collect large older generations (mature
objects) non-disruptively. The algorithm’s approach is to process
bounded-size pieces of mature object space at each collection; the
subtleties lie in guaranteeing that it eventually collects any and
all garbage. The algorithm does not assume any special hardware or
operating system support, e.g., for forwarding pointers or
protection traps. The algorithm copies objects, so it naturally
supports compaction and reclustering.

A new garbage collection algorithm for distributed object systems,
called DMOS (Distributed Mature Object Space), is presented. It is
derived from two previous algorithms, MOS (Mature Object Space),
sometimes called the train algorithm, and PMOS (Persistent Mature
Object Space). The contribution of DMOS is that it provides the
following unique combination of properties for a distributed
collector: safety, completeness, non-disruptiveness,
incrementality, and scalability. Furthermore, the DMOS collector
is non-blocking and does not use global tracing.

Dynamic memory use has been widely recognized to have profound
effects on program performance, and has been the topic of many
research studies over the last forty years. In spite of years of
research, there is considerable confusion about the effects of
dynamic memory allocation. Worse, this confusion is often
unrecognized, and memory allocators are widely thought to be
fairly well understood.

In this research, we attempt to clarify many issues for both
manual and automatic non-moving memory management. We show that
the traditional approaches to studying dynamic memory allocation
are unsound, and develop a sound methodology for studying this
problem. We present experimental evidence that fragmentation costs
are much lower than previously recognized for most programs, and
develop a framework for understanding these results and enabling
further research in this area. For a large class of programs using
well-known allocation policies, we show that fragmentation costs
are near zero. We also study the locality effects of memory
allocation on programs, a research area that has been almost
completely ignored. We show that these effects can be quite
dramatic, and that the best allocation policies in terms of
fragmentation are also among the best in terms of locality at both
the cache and virtual memory levels of the memory hierarchy.

We extend these fragmentation and locality results to real-time
garbage collection. We have developed a hard real-time,
non-copying generational garbage collector which uses a
write-barrier to coordinate collection work only with
modifications of pointers, therefore making coordination costs
cheaper and more predictable than previous approaches. We combine
this write-barrier approach with implicit non-copying reclamation,
which has most of the advantages of copying collection (notably
avoiding both the sweep phase required by mark-sweep collectors,
and the referencing of garbage objects when reclaiming their
space), without the disadvantage of having to actually copy the
objects. In addition, we present a model for non-copying
implicit-reclamation garbage collection. We use this model to
compare and contrast our work with that of others, and to discuss
the tradeoffs that must be made when developing such a garbage
collector.

We show that for 8 real and varied C and C++ programs, several
conventional dynamic storage allocators provide near-zero
fragmentation, once overheads due to implementation details
(headers, alignment, etc.) are properly accounted for. This
substantially strengthens our previous results showing that the
memory fragmentation problem has generally been misunderstood, and
that good allocator policies can provide good memory usage for
most programs. The new results indicate that for most programs,
excellent allocator policies are readily available, and efficiency
of implementation is the major challenge. While we believe that
our experimental results are state-of-the-art and our methodology
is superior to most previous work, more work should be done to
identify and study unusual problematic program behaviors not
represented in our sample.

The G-machine is a compiled graph reduction machine for lazy
functional languages. The G-machine compiler contains many
optimisations to improve performance. One set of such
optimisations is designed to improve the performance of tail
recursive functions. Unfortunately the abstract machine is subject
to a space leak–objects are unnecessarily preserved by the
garbage collector.

This paper analyses why a particular form of space leak occurs in
the G-machine, and presents some ideas for fixing this problem.
This phenomena in other abstract machines is also examined
briefly.

Weighted Reference Counting is a low-communication distributed
storage reclamation scheme for loosely-coupled multiprocessors.
The algorithm we present herein extends weighted reference
counting to allow the collection of cyclic data structures. To do
so, the algorithm identifies candidate objects that may be part of
cycles and performs a tricolour mark-scan on their subgraph in a
lazy manner to discover whether the subgraph is still in use. The
algorithm is concurrent in the sense that multiple useful
computation processes and garbage collection processes can be
performed simultaneously.

The memory storage requirements of complex programs are extremely
difficult to manage correctly by hand. A single error may lead to
indeterminate and inexplicable program crashes. Worse still,
failures are often unrepeatable and may surface only long after
the program has been delivered to the customer. The eradication of
memory errors typically consumes a substantial amount of
development time. And yet the answer is relatively easy – garbage
collection; removing the clutter of memory management from module
interfaces, which then frees the programmer to concentrate on the
problem at hand rather than low-level book-keeping details. For
this reason, most modern object-oriented languages such as
Smalltalk, Eiffel, Java and Dylan, are supported by garbage
collection. Garbage collecting libraries are even available for
such uncooperative languages as C and C++.

This book considers how dynamic memory can be recycled
automatically to guarantee error-free memory management. There is
an abundant but disparate literature on the subject, largely
confined to research papers. This book sets out to pool this
experience in a single accessible and unified framework.

Each of the important algorithms is explained in detail, often
with illustrations of its characteristic features and animations
of its use. Techniques are described and compared for declarative
and imperative programming styles, for sequential, concurrent and
distributed architectures.

For professionals developing programs from simple software tools
to complex systems, as well as for researchers and students
working in compiler construction, functional, logic and
object-oriented programming design, this book will provide not
only a clear introduction but also a convenient reference source
for modern garbage collection techniques.

The International Symposium on Memory Management is a forum for
research in several related areas of memory management, especially
garbage collectors and dynamic storage allocators. [...] The
nineteen papers selected for publication in this volume cover a
remarkably broad range of memory management topics from explicit
malloc-style allocation to automatic memory management, from
cache-conscious data layout to efficient management of distributed
references, from conservative to type-accurate garbage collection,
for applications ranging from user application to long-running
servers, supporting languages as different as C, C++, Modula-3,
Java, Eiffel, Erlang, Scheme, ML, Haskell and Prolog.

The C++?? Critique is an analysis of some of the flaws of C++. It
is by no means exhaustive, nor does it attempt to document every
little niggle with C++, rather concentrating on main themes. The
critique uses Java and Eiffel as comparisons to C++ to give a more
concrete feel to the criticisms, viewing conceptual differences
rather than syntactic ones as being more important. Some C++
authors realising there are glaring deficiencies in C++ have
chosen to defend C++ by also being critical within their own work.
Most notable are Bjarne Stroustup’s “Design and Evolution of C++,”
and Scott Meyers’ “Effective” and “More Effective C++.” These warn
of many traps and pitfalls, but reach the curious conclusion that
since “good” C++ programmers are aware of these problems and know
how to avoid them, C++ is alright.

The C++ critique makes many of the same criticisms, but comes to
the different conclusion that these pitfalls are not acceptable,
and should not be in a language used for modern large scale
software engineering. Clean design is more important than after
the fact warnings, and it is inconceivable that purchasers of end
user software would tolerate this tactic on the part of vendors.
The critique also takes a look at C, and concludes that many of
the features of C should be left out of modern languages, and that
C is a flawed base for a language.

This paper studies a representative of an important class of
emerging applications, a parallel data mining workload. The
application, extracted from the IBM Intelligent Miner, identifies
groups of records that are mathematically similar based on a
neural network model called self-organizing map. We examine and
compare in details two implementations of the application:
(1) temporal locality or working set sizes; (2) spatial locality
and memory block utilization; (3) communication characteristics
and scalability; and (4) TLB performance.

First, we find that the working set hierarchy of the application
is governed by two parameters, namely the size of an input record
and the size of prototype array; it is independent of the number
of input records. Second, the application shows good spatial
locality, with the implementation optimized for sparse data sets
having slightly worse spatial locality. Third, due to the batch
update scheme, the application bears very low communication.
Finally, a 2-way set associative TLB may result in severely skewed
TLB performance in a multiprocessor environment caused by the
large discrepancy in the amount of conflict misses. Increasing the
set associativity is more effective in mitigating the problem than
increasing the TLB size.

This paper studies the memory system behavior of Java programs by
analyzing memory reference traces of several SPECjvm98
applications running with a Just-In-Time (JIT) compiler. Trace
information is collected by an exception-based tracing tool called
JTRACE, without any instrumentation to the Java programs or the
JIT compiler.First, we find that the overall cache miss ratio is
increased due to garbage collection, which suffers from higher
cache misses compared to the application. We also note that going
beyond 2-way cache associativity improves the cache miss ratio
marginally. Second, we observe that Java programs generate a
substantial amount of short-lived objects. However, the size of
frequently-referenced long-lived objects is more important to the
cache performance, because it tends to determine the application’s
working set size. Finally, we note that the default heap
configuration which starts from a small initial heap size is very
inefficient since it invokes a garbage collector frequently.
Although the direct costs of garbage collection decrease as we
increase the available heap size, there exists an optimal heap
size which minimizes the total execution time due to the
interaction with the virtual memory performance.

A stable heap is a storage that is managed automatically using
garbage collection, manipulated using atomic transactions, and
accessed using a uniform storage model. These features enhance
reliability and simplify programming by preventing errors due to
explicit deallocation, by masking failures and concurrency using
transactions, and by eliminating the distinction between accessing
temporary storage and permanent storage. Stable heap management is
useful for programming language for reliable distributed
computing, programming languages with persistent storage, and
object-oriented database systems. Many applications that could
benefit from a stable heap (e.g., computer-aided design,
computer-aided software engineering, and office information
systems) require large amounts of storage, timely responses for
transactions, and high availability. We present garbage collection
and recovery algorithms for a stable heap implementation that meet
these goals and are appropriate for stock hardware. The collector
is incremental: it does not attempt to collect the whole heap at
once. The collector is also atomic: it is coordinated with the
recovery system to prevent problems when it moves and modifies
objects . The time for recovery is independent of heap size, and
can be shortened using checkpoints.

Prior work on dynamic memory allocation has largely neglected
long-running server applications, for example, web servers and
mail servers. Their requirements differ from those of one-shot
applications like compilers or text editors. We investigated how
to build an allocator that is not only fast and memory efficient
but also scales well on SMP machines. We found that it is not
sufficient to focus on reducing lock contention. Only limited
improvement can be achieved this way; higher speedups require a
reduction in cache misses and cache invalidation traffic. We then
designed and prototyped a new allocator, called Lkmalloc, targeted
for both traditional applications and server applications.
LKmalloc uses several subheaps, each one with a separate set of
free lists and memory arena. A thread always allocates from the
same subheap but can free a block belonging to any subheap. A
thread is assigned to a subheap by hashing on its thread ID. We
compared its performance with several other allocators on a
server-like, simulated workload and found that it indeed scales
well and is quite fast but could use memory more efficiently.

In previous heap storage systems, the cost of creating objects and
garbage collection is independent of the lifetime of the object.
Since objects with short lifetimes account for a large portion of
storage use, it is worth optimizing a garbage collector to reclaim
storage for these objects more quickly. The garbage collector
should spend proportionately less effort reclaiming objects with
longer lifetimes. We present a garbage collection algorithm that
(1) makes storage for short-lived objects cheaper than storage for
long-lived objects, (2) that operates in real-time–object
creation and access times are bounded, (3) increases locality of
reference, for better virtual memory performance, (4) works well
with multiple processors and a large address space.

A programming system called LISP (for LISt Processor) has been
developed for the IBM 704 computer by the Artificial Intelligence
group at M.I.T. The system was designed to facilitate experiments
with a proposed system called the Advice Taker, whereby a machine
could be instructed to handle declarative as well as imperative
sentences and could exhibit “common sense” in carrying out its
instructions. The original proposal for the Advice Taker was made
in November 1958. The main requirement was a programming system
for manipulating expressions representing formalized declarative
and imperative sentences so that the Advice Taker could make
deductions.

In the course of its development the LISP system went through
several stages of simplification and eventually came to be based
on a scheme for representing the partial recursive functions of a
certain class of symbolic expressions. This representation is
independent of the IBM 704 computer, or of any other electronic
computer, and it now seems expedient to expound the system by
starting with the class of expressions called S-expressions and
the functions called S-functions.

John McCarthy. 1979. “History of Lisp”. In History of programming languages I, pp. 173–185. ACM.

Presents a survey of both distributed shared memory (DSM) efforts
and commercial DSM systems. The book discusses relevant issues
that make the concept of DSM one of the most attractive approaches
for building large-scale, high-performance multiprocessor systems.
Its text provides a general introduction to the DSM field as well
as a broad survey of the basic DSM concepts, mechanisms, design
issues, and systems.

Distributed Shared Memory concentrates on basic DSM algorithms,
their enhancements, and their performance evaluation. In addition,
it details implementations that employ DSM solutions at the
software and the hardware level. The book is a research and
development reference that provides state-of-the art information
that will be useful to architects, designers, and programmers of
DSM systems.

This paper presents an algorithm for reclaiming unused free
storage memory cells is LISP. It depends on availability of a fast
secondary storage device, or a large block of available temporary
storage. For this price, we get 1. Packing of free-storage into a
solidly packed block. 2. Smooth packing of arbitrary linear blocks
and arrays. 3. The collector will handle arbitrarily complex
re-entrant list structure with no introduction of spurious copies.
4. The algorithm is quite efficient; the marking pass visits words
at most twice and usually once, and the loading pass is linear.
5. The system is easily modified to allow for increase in size of
already fixed consecutive blocks, provide one can afford to
initiate a collection pass or use a modified array while waiting
for such a pass to occur.

This paper discusses garbage collection techniques used in a
high-performance Lisp implementation with a large virtual memory,
the Symbolics 3600. Particular attention is paid to practical
issues and experience. In a large system problems of scale appear
and the most straightforward garbage-collection techniques do not
work well. Many of these problems involve the interaction of the
garbage collector with demand-paged virtual memory. Some of the
solutions adopted in the 3600 are presented, including incremental
copying garbage collection, approximately depth-first copying,
ephemeral objects, tagged architecture, and hardware assists. We
discuss techniques for improving the efficiency of garbage
collection by recognizing that objects in the Lisp world have a
variety of lifetimes. The importance of designing the architecture
and the hardware to facilitate garbage collection is stressed.

Massively distributed computing is a challenging problem for
garbage collection algorithm designers as it raises the issue of
scalability. The high number of hosts involved in a computation
can require large tables for reference listing, whereas the lack
of information sharing between hosts in a same locality can entail
redundant GC traffic. In this paper, we argue that a conceptual
hierarchical organisation of massive distributed computations can
solve this problem. By conceptual hierarchical organisation, we
mean that processors are still able to communicate in a peer to
peer manner using their usual communication mechanism, but GC
messages will be routed as if processors were organised in
hierarchy. We present an extension of a distributed reference
counting algorithm that uses such a hierarchical organisation. It
allows us to bound table sizes by the number of hosts in a domain,
and it allows us to share GC information between hosts in a same
locality in order to reduce cross-network GC traffic.

Most specifications of garbage collectors concentrate on the
low-level algorithmic details of how to find and preserve
accessible objects. Often, they focus on bit-level manipulations
such as “scanning stack frames,” “marking objects,” “tagging
data,” etc. While these details are important in some contexts,
they often obscure the more fundamental aspects of memory
management: what objects are garbage and why?

We develop a series of calculi that are just low-level enough that
we can express allocation and garbage collection, yet are
sufficiently abstract that we may formally prove the correctness
of various memory management strategies. By making the heap of a
program syntactically apparent, we can specify memory actions as
rewriting rules that allocate values on the heap and automatically
dereference pointers to such objects when needed. This formulation
permits the specification of garbage collection as a relation that
removes portions of the heap without affecting the outcome of
evaluation.

Our high-level approach allows us to specify in a compact manner a
wide variety of memory management techniques, including standard
trace-based garbage collection (i.e., the family of copying and
mark/sweep collection algorithms), generational collection, and
type-based, tag-free collection. Furthermore, since the definition
of garbage is based on the semantics of the underlying language
instead of the conservative approximation of inaccessibility, we
are able to specify and prove the idea that type inference can be
used to collect some objects that are accessible but never used.

PMOS is an incremental garbage collector designed specifically to
reclaim space in a persistent object store. It is complete in that
it will, after a finite number of invocations, reclaim all
unreachable storage. PMOS imposes minimum constraints on the order
of collection and offers techniques to reduce the I/O traffic
induced by the collector. Here we present the first implementation
of the PMOS collector called PMOS#1. The collector has been
incorporated into the stable heap layer of the generic persistent
object store used to support a number of languages including
Napier88. Our main design goals are to maintain the independence
of the language from the store and to retain the existing store
interface. The implementation has been completed and tested using
a Napier88 system. The main results of this work show that the
PMOS collector is implementable in a persistent store and that it
can be built without requiring changes to the language
interpreter. Initial performance measurements are reported. These
results suggest however, that effective use of PMOS requires
greater co-operation between language and store.

We introduce a new replication-based copying garbage collection
technique. We have implemented one simple variation of this method
to provide incremental garbage collection on stock hardware with
no special operating system or virtual memory support. The
performance of the prototype implementation is excellent: major
garbage collection pauses are completely eliminated with only a
slight increase in minor collection pause times.

Unlike the standard copying algorithm, the replication-based
method does not destroy the original replica when a copy is
created. Instead, multiple copies may exist, and various standard
strategies for maintaining consistency may be applied. In our
implementation for Standard ML of New Jersey, the mutator
continues to use the from-space replicas until the collector has
achieved a consistent replica of all live data in to-space.

We present a design for a concurrent garbage collector using the
replication-based technique. We also expect replication-based GC
methods to be useful in providing services for persistence and
distribution, and briefly discuss these possibilities.

Garbage collection (GC) is an important part of many language
implementations. One of the most important garbage collection
techniques is copying GC. This paper consists of an informal but
abstract description of copying collection, a formal specification
of copying collection written in the Larch Shared Language and the
Larch/C Interface Language, a simple implementation of a copying
collector written in C, an informal proof that the implementation
satisfies the specification, and a discussion of how the
specification applies to other types of copying GC such as
generational copying collectors. Limited familiarity with copying
GC or Larch is needed to read the specification.

Orthogonal persistence provides a safe and convenient model of
object persistence. We have implemented a transaction system which
supports orthogonal persistence in a garbage-collected heap. In
our system, replicating collection provides efficient concurrent
garbage collection of the heap. In this paper, we show how
replicating garbage collection can also be used to reduce commit
operation latencies in our implementation.

We describe how our system implements transaction commit. We
explain why the presence of non-persistent objects can add to the
cost of this operation. We show how to eliminate these additional
costs by using replicating garbage collection. The resulting
implementation of orthogonal persistence should provide
transaction performance that is independent of the quantity of
non-persistent data in use. We expect efficient support for
orthogonal persistence to be valuable in operating systems
applications which use persistent data.

We have implemented the first copying garbage collector that
permits continuous unimpeded mutator access to the original
objects during copying. The garbage collector incrementally
replicates all accessible objects and uses a mutation log to bring
the replicas up-to-date with changes made by the mutator. An
experimental implementation demonstrates that the costs of using
our algorithm are small and that bounded pause times of 50
milliseconds can be readily achieved.

Generational garbage collection is a simple technique for
automatic partial memory reclamation. In this paper, I present the
basic mechanics of generational collection and discuss its
characteristics. I compare several published algorithms and argue
that fundamental considerations of locality, as reflected in the
changing relative speeds of processors, memories, and disks,
strongly favor a focus on explicit optimization of I/O
requirements during garbage collection. I show that this focus on
I/O costs due to memory hierarchy debunks a well-known claim about
the relative costs of garbage collection and stack allocation. I
suggest two directions for future research in this area and
discuss some simple architectural changes in virtual memory
interfaces which may enable efficient garbage collector
utilization of standard virtual memory hardware.

We have implemented a concurrent copying garbage collector that
uses replicating garbage collection. In our design, the client can
continuously access the heap during garbage collection. No
low-level synchronization between the client and the garbage
collector is required on individual object operations. The garbage
collector replicates live heap objects and periodically
synchronizes with the client to obtain the client’s current root
set and mutation log. An experimental implementation using the
Standard ML of New Jersey system on a shared-memory multiprocessor
demonstrates excellent pause time performance and moderate
execution time speedups.

For a compiler writer, generating good machine code for a variety
of platforms is hard work. One might try to reuse a retargetable
code generator, but code generators are complex and difficult to
use, and they limit one’s choice of implementation language. One
might try to use C as a portable assembly language, but C limits
the compiler writer’s flexibility and the performance of the
resulting code. The wide use of C, despite these drawbacks, argues
for a portable assembly language. C– is a new language designed
expressly for this purpose. The use of a portable assembly
language introduces new problems in the support of such high-level
run-time services as garbage collection, exception handling,
concurrency, profiling, and debugging. We address these problems
by combining the C– language with a C– run-time interface. The
combination is designed to allow the compiler writer a choice of
source-language semantics and implementation techniques, while
still providing good performance.

Software caching, automatic algorithm blocking, and data overlays
are different names for the same problem: compiler management of
data movement throughout the memory hierarchy. Modern
high-performance architectures often omit hardware support for
moving data between levels of the memory hierarchy: iWarp does not
include a data cache, and Cray supercomputers do not have virtual
memory. These systems have effectively traded a more complicated
programming model for performance by replacing a
hardware-controlled memory hierarchy with a simple fast memory.
The simpler memories have less logic in the critical path, so the
cycle time of the memories is improved.

For programs which fit in the resulting memory, the extra
performance is great. Unfortunately, the driving force behind
supercomputing today is a class of very large scientific problems,
both in terms of computation time and in terms of the amount of
data used. Many of these programs do not fit in the memory of the
machines available. When architects trade hardware support for
data migration to gain performance, control of the memory
hierarchy is left to the programmer. Either the program size must
be cut down to fit into the machine, or every loop which accesses
more data than will fit into memory must be restructured by hand.
This thesis describes how a compiler can relieve the programmer of
this burden, and automate data motion throughout the memory
hierarchy without direct hardware support.

This works develops a model of how data is accessed within a
nested loop by typical scientific programs. It describes
techniques which can be used by compilers faced with the task of
managing data motion. The concentration is on nested loops which
process large data arrays using linear array subscripts. Because
the array subscripts are linear functions of the loop indices and
the loop indices form an integer lattice, linear algebra can be
applied to solve many compilation problems.

The approach it to tile the iteration space of the loop nest.
Tiling allows the compiler to improve locality of reference. The
tiling basis matrix is chosen from a set of candidate vectors
which neatly divide the data set. The execution order of the tiles
is selected to maximize locality between tiles. Finally, the tile
sizes are chosen to minimize execution time.

The approach has been applied to several common scientific loop
nests: matrix-matrix multiplication, QR-decomposition, and
LU-decomposition. In addition, an illustrative example from the
Livermore Loop benchmark set is examined. Although more compiler
time can be required in some cases, this technique produces better
code at no cost for most programs.

This paper presents a classification of barrier techniques for
interleaving tracing with mutator operation during an incremental
garbage collection. The two useful tricolour invariants are
derived from more elementary considerations of graph traversal.
Barrier techniques for maintaining these invariants are classified
according to the action taken at the barrier (such as scanning an
object or changing its colour), and it is shown that the
algorithms described in the literature cover all the possibilities
except one. Unfortunately, the new technique is impractical. Ways
of combining barrier techniques are also discussed.

Tony Printezis. 1996. “Disk Garbage Collection Strategies for Persistent Java”. Proceedings of the First International Workshop on Persistence and Java.

Abstract

This paper presents work currently in progress on Disk Garbage
Collection issues for PJava, an orthogonally persistent version of
Java. In particular, it concentrates on the initial Prototype of
the Disk Garbage Collector of PJava0 which has already been
implemented. This Prototype was designed to be very simple and
modular in order to be easily changed, evolved, improved, and
allow experimentation. Several experiments were performed in order
to test possible optimisations; these experiments concentrated on
the following four areas: a) efficient access to the store; b)
page-replacement algorithms; c) efficient discovery of live
objects during compaction; and d) dealing with forward references.
The paper presents a description of the Prototype’s architecture,
the results of these experiments and related discussion, and some
future directions based on the experience gained from this work.

As processor speeds continue to improve relative to main-memory
access times, cache performance is becoming an increasingly
important component of program performance. Prior work on the
cache performance of garbage-collected programming languages has
either assumed or argued that conventional garbage-collection
methods will yield poor performance, and has therefore
concentrated on new collection algorithms designed specifically to
improve cache-level reference locality. This dissertation argues
to the contrary: Many programs written in garbage-collected
languages are naturally well-suited to the direct-mapped caches
typically found in modern computer systems.

Using a trace-driven cache simulator and other analysis tools,
five nontrivial, long-running Scheme programs are studied. A
control experiment shows that the programs have excellent cache
performance without any garbage collection at all. A second
experiment indicates that the programs will perform well with a
simple and infrequently-run generational compacting collector.

An analysis of the test programs’ memory usage patterns reveals
that the mostly-functional programming style typically used in
Scheme programs, in combination with simple linear storage
allocation, causes most data objects to be dispersed in time and
space so that references to them cause little cache interference.
From this it follows that other Scheme programs, and programs
written in similar styles in different languages, should perform
well with a simple generational compacting collector;
sophisticated collectors intended to improve cache performance are
unlikely to be effective. The analysis also suggests that, as
locality becomes ever more important to program performance,
programs written in garbage-collected languages may turn out to
have significant performance advantage over programs written in
more conventional languages.

It is well accepted that automatic garbage collection simplifies
programming, promotes modularity, and reduces development effort.
However it is commonly believed that these advantages do not
counteract the perceived price: excessive overheads, possible long
pause times while garbage collections occur, and the need to
modify existing code. Even though there are publically available
garbage collector implementations that can be used in existing
programs, they do not guarantee short pauses, and some
modification of the application using them is still required. In
this paper we describe a snapshot-at-beginning concurrent garbage
collector algorithm and its implementation. This algorithm
guarantees short pauses, and can be easily implemented on stock
UNIX-like operating systems. Our results show that our collector
performs comparable to other garbage collection implementations on
uniprocessor machines and outperforms similar collectors on
multiprocessor machines. We also show our collector to be
competitive in performance with explicit deallocation. Our
collector has the added advantage of being non-intrusive. Using a
dynamic linking technique and effective root set inferencing, we
have been able to successfully run our collector even in
commercial programs where only the binary executable and no source
code is available. In this paper we describe our algorithm, its
implementation, and provide both an algorithmic and a performance
comparison between our collector and other similar garbage
collectors.

Self-compiling implementations of Haskell, i.e., those written in
Haskell, have been and, except one, are still space consuming
monsters. Object code size for the compilers themselves are 3-8Mb,
and they need 12-20Mb to recompile themselves. One reason for the
huge demands for memory is that the main goal for these compilers
is to produce fast code. However, the compiler described in this
paper, called “nhc” for “Nearly a Haskell Compiler”, is the one
above mentioned exception. This compiler concentrates on keeping
memory usage down, even at a cost in time. The code produced is
not fast, but nhc is usable, and the resulting programs can be run
on computers with small memory.

This paper describes some of the implementation choices done, in
the Haskell part of the source code, to reduce memory consumption
in nhc. It is possible to use these also in other Haskell
compilers with no, or very small, changes to their run-time
systems.

Time is neither the main focus of nhc nor of this paper, but there
is nevertheless a small section about the speed of nhc. The most
notable observation concerning speed is that nhc spends
approximately half the time processing interface files, which is
much more than needed in the type checker. Processing interface
files is also the most space consuming part of nhc in most cases.
It is only when compiling source files with large sets of mutually
recursive functions that more memory is needed to type check than
to process interface files.

Generational garbage collection is an established method for
creating efficient garbage collectors. Even a simple
implementation where all nodes that survive one garbage collection
are tenured, i.e., moved to an old generation, works well in
strict languages. In lazy languages, however, such an
implementation can create severe temporary space leaks. The
temporary space leaks appear in programs that traverse large
lazily built data structures, e.g., a lazy list representing a
large file, where only a small part is needed at any time. A
simple generational garbage collector cannot reclaim the memory,
used by the lazily built list, at minor collections. The reason is
that at least one of the nodes in the list belongs to the old
generation, after the first minor collection, and will hold on to
the rest of the nodes in the list until the next major collection.

The context for this paper is functional computation by graph
reduction. Our overall aim is more efficient use of memory. The
specific topic is the detection of dormant cells in the live graph
– those retained in heap memory though not actually playing a
useful role in computation. We describe a profiler that can
identify heap consumption by such ‘useless’ cells. Unlike heap
profilers based on traversals of the live heap, this profiler
works by examining cells post-mortem. The new profiler has
revealed a surprisingly large proportion of ‘useless’ cells, even
in some programs that previously seemed space-efficient such as
the bootstrapping Haskell compiler “nhc”.

Stoye’s one-bit reference tagging scheme can be extended to local
counts of two or more via two strategies. The first, suited to
pure register transactions, is a cache of referents to two shared
references. The analog of Deutch’s and Bobrow’s multiple-reference
table, this cache is sufficient to manage small counts across
successive assignment statements. Thus, accurate reference counts
above one can be tracked for short intervals, like that bridging
one function’s environment to its successor’s.

The second, motivated by runtime stacks that duplicate references,
avoids counting any references from the stack. It requires a local
pointer-inversion protocol in the mutator, but one still local to
the referent and the stack frame. Thus, an accurate reference
count of one can be maintained regardless of references from the
recursion stack.

Enough is known now about garbage collection, runtime types,
strong-typing, static-checking and concurrency that it is possible
to explore what happens when they are combined in a real
programming system.

Storage management is one of a few central issues through which
one can get a good view of the design of an entire system.
Tensions between ease of integration and the need for protection;
between generality, simplicity, flexibility, extensibility and
efficiency are all manifest when assumptions and attitudes about
managing storage are studied. And deep understanding follows best
from the analysis of systems that people use to get real work
done.

This paper is not for those who seek arguments pro or con about
the need for these features in programming systems; such issues
are for other papers. This one assumes these features to be good
and describes how they combine and interact in Cedar, a
programming language and environment designed to help programmers
build moderate-sized experimental systems for moderate numbers of
people to test and use.

We describe the design, implementation, and use of a new kind of
profiling tool that yields valuable information about the memory
use of lazy functional programs. The tool has two parts: a
modified functional language implementation which generated
profiling implementation during the execution of programs, and a
separate program which converts this information to graphical
form. With the aid of profile graphs, one can make alterations to
a functional program which dramatically reduce its space
consumption. We demonstrate that this is the case of a genuine
example – the first to which the tool has been applied – for
which the results are strikingly successful.

First-generation heap profilers for lazy functional languages have
proved to be effective tools for locating some kinds of space
faults, but in other cases they cannot provide sufficient
information to solve the problem. This paper describes the design,
implementation and use of a new profiler that goes beyond the
two-dimensional “who produces what” view of heap cells to provide
information about their more dynamic and structural attributes.
Specifically, the new profiler can distinguish between cells
according to their eventual lifetime, or on the basis of the
closure retainers by virtue of which they remain part of the
live heap. A bootstrapping Haskell compiler (nhc) hosts the
implementation: among examples of the profiler’s use we include
self-application to nhc. Another example is the original
heap-profiling case study “clausify”, which now consumes even less
memory and is much faster.

Colin Runciman & Niklas Röjemo. 1996. “Two-pass heap profiling: a matter of life and death”. Department of Computer Science, University of York.

Abstract

A heap profile is a chart showing the contents of heap memory
throughout a computation. Contents are depicted abstractly by
showing how much space is occupied by memory cells in each of
several classes. A good heap profiler can use a variety of
attributes of memory cells to de-fine a classification. Effective
profiling usually involves a combination of attributes. The ideal
profiler gives full support for combination in two ways. First, a
section of the heap of interest to the programmer can be specified
by constraining the values of any combination of cell attributes.
Secondly, no matter what attributes are used to specify such a
section, a heap profile can be obtained for that section only, and
any other attribute can be used to define the classification.

Achieving this ideal is not simple For some combinations of
attributes. A heap profile is derived by interpolation of a series
of censuses of heap contents at different stages. The obvious way
to obtain census data is to traverse the live heap at intervals
throughout the computation. This is fine for static attributes
(e.g. What type of value does this memory cell represent?), and
for dynamic attributes that can be determined for each cell by
examining the heap at any given moment (e.g. From which function
closures can this cell be reached?). But some attributes of cells
can only be determined retrospectively by post-mortem inspection
asa cell is overwritten or garbage-collected (e.g. Is this cell
ever used again?). Now we see the problem: if a profiler supports
both live and pose-mortem attributes, how can we implement the
ideal of unrestricted combinations? That is the problem me solve
in this paper. We give techniques for profiling a. heap section
specified in terms of both live and post-mortem attributes. We
show how to generate live-attribute profiles of a section of the
heal, specified using post-mortem attributes, and vice versa.

We present an implementation of the Train Algorithm, an
incremental collection scheme for reclamation of mature garbage in
generation-based memory management systems. To the best of our
knowledge, this is the first Train Algorithm implementation ever.
Using the algorithm, the traditional mark-sweep garbage collector
employed by the Mj&oslash;lner run-time system for the
object-oriented BETA programming language was replaced by a
non-disruptive one, with only negligible time and storage
overheads.

Memory is the performance bottleneck of modern architectures.
Keeping memory consumption as low as possible enables fast and
unobtrusive applications. But it is not easy to estimate the
memory use of programs implemented in functional languages, due to
both the complex translations of some high level constructs, and
the use of automatic memory managers. To help understand memory
allocation behavior of Scheme programs, we have designed two
complementary tools. The first one reports on frequency of
allocation, heap configurations and on memory reclamation. The
second tracks down memory leaks. We have applied these tools to
our Scheme compiler, the largest Scheme program we have been
developing. This has allowed us to drastically reduce the amount
of memory consumed during its bootstrap process, without requiring
much development time. Development tools will be neglected unless
they are both conveniently accessible and easy to use. In order to
avoid this pitfall, we have carefully designed the user interface
of these two tools. Their integration into a real programming
environment for Scheme is detailed in the paper.

Larchant-RDOSS is a distributed shared memory that persists on
reliable storage across process lifetimes. Memory management is
automatic: including consistent caching of data and of locks,
collecting objects unreachable from the persistent root, writing
reachable objects to disk, and reducing store fragmentation.
Memory management is based on a novel garbage collection
algorithm, that approximates a global trace by a series of local
traces, with no induced I/O or locking traffic, and no
synchronization between the collector and the application
processes. This results in a simple programming model, and
expected minimal added application latency. The algorithm is
designed for the most unfavorable environment (uncontrolled
programming language, reference by pointers, distributed system,
non-coherent shared memory) and should work well also in more
favorable settings.

Texas is a persistent storage system for C++, providing high
performance while emphasizing simplicity, modularity and
portability. A key component of the design is the use of pointer
swizzling at page fault time, which exploits existing virtual
memory features to implement large address spaces efficiently on
stock hardware, with little or no change to existing compilers.
Long pointers are used to implement an enormous address space, but
are transparently converted to the hardware-supported pointer
format when pages are loaded into virtual memory.

Runtime type descriptors and slightly modified heap allocation
routines support pagewise pointer swizzling by allowing objects
and their pointer fields to be identified within pages. If
compiler support for runtime type identification is not available,
a simple preprocessor can be used to generate type descriptors.

This address translation is largely independent of issues of data
caching, sharing, and checkpointing; it employs operating systems’
existing virtual memories for caching, and a simple and flexible
log-structured storage manager to improve checkpointing
performance.

Pagewise virtual memory protections are also used to detect writes
for logging purposes, without requiring any changes to compiled
code. This may degrade checkpointing performance for small
transactions with poor locality of writes, but page diffing and
sub-page logging promise to keep performance competitive with
finer-grained checkpointing schemes.

Texas presents a simple programming interface; an application
creates persistent objects by simply allocating them on the
persistent heap. In addition, the implementation is relatively
small, and is easy to incorporate into existing applications. The
log-structured storage module easily supports advanced extensions
such as compressed storage, versioning, and adaptive
reorganization.

Garbage collector performance in LISP systems on custom hardware has been substantially improved by the adoption of lifetime-based garbage collection techniques. To date, however, successful lifetime-based garbage collectors have required special-purpose hardware, or at least privileged access to data structures maintained by the virtual memory system. I present here a lifetime-based garbage collector requiring no special-purpose hardware or virtual memory system support, and discuss its performance.

Algorithms for a multiprocessing compactifying garbage collector
are presented and discussed. The simple case of two processors,
one performing LISP-like list operations and the other performing
garbage collection continuously, is thoroughly examined. The
necessary capabilities of each processor are defined, as well as
interprocessor communication and interlocks. Complete procedures
for garbage collection and for standard list processing primitives
are presented and thoroughly explained. Particular attention is
given to the problems of marking and relocating list cells while
another processor may be operating on them. The primary aim
throughout is to allow the list processor to run unimpeded while
the other processor reclaims list storage The more complex case
involving several list processors and one or more garbage
collection processors are also briefly discussed.

The internal representations of the various MacLISP data types are
presented and discussed. Certain implementation tradeoffs are
considered. The ultimate decisions on these tradeoffs are
discussed in the light of MacLISP’s prime objective of being an
efficient high-level language for the implementation of large
systems such as MACSYMA. The basic strategy of garbage collection
is outlined, with reference to the specific representations
involved. Certain “clever tricks” are explained and justified. The
“address space crunch” is explained and some alternative solutions
explored.

A high-performance implementation of a Java Virtual Machine
requires a compiler to translate Java bytecodes into native
instructions, as well as an advanced garbage collector (e.g.,
copying or generational). When the Java heap is exhausted and the
garbage collector executes, the compiler must report to the
garbage collector all live object references contained in physical
registers and stack locations. Typical compilers only allow
certain instructions (e.g., call instructions and backward
branches) to be GC-safe; if GC happens at some other instruction,
the compiler may need to advance execution to the next GC-safe
point. Until now, no one has ever attempted to make every
compiler-generated instruction GC-safe, due to the perception that
recording this information would require too much space. This kind
of support could improve the GC performance in multithreaded
applications. We show how to use simple compression techniques to
reduce the size of the GC map to about 20% of the generated code
size, a result that is competitive with the best previously
published results. In addition, we extend the work of Agesen,
Detlefs, and Moss, regarding the so-called “JSR Problem” (the
single exception to Java’s type safety property), in a way that
eliminates the need for extra runtime overhead in the generated
code.

The SKIM II processor is a microcoded hardware machine for the
rapid evaluation of functional languages. This paper gives details
of some of the more novel methods employed by SKIM II, and
resulting performance measurements. The authors conclude that
combinator reduction can still form the basis for the efficient
implementation of a functional language.

We study the cost of storage management for garbage-collected
programs compiled with the Standard ML of New Jersey compiler. We
show that the cost of storage management is not the same as the
time spent garbage collecting. For many of the programs, the time
spent garbage collecting is less than the time spent doing other
storage-management tasks.

Shared environment closure reducers such as Fairbairn and Wray’s
TIM incur a comparatively low cost when creating a suspension, and
so provide an elegant method for implementing lazy functional
evaluation. However, comparatively little attention has been given
to the problems involved in identifying which portions of a shared
environment are needed (and ignoring those which are not) during a
garbage collection. Proper consideration of this issue has subtle
consequences when implementing a storage manager in a TIM-like
system. We describe the problem and illustrate the negative
consequences of ignoring it.

We go on to describe a solution in which the compiler determines
statically which portions of that code’s environment are required
for each piece of code it generates, and emits information to
assist the run-time storage manager to scavenge environments
selectively. We also describe a technique for expressing this
information directly as executable code, and demonstrate that a
garbage collector implemented in this way can perform
significantly better than an equivalent, table-driven interpretive
collector.

Implementations of abstract machines such as the OP-TIM and the
PG-TIM need to use a tailored garbage collector which seems to
require an auxiliary stack,with a potential maximum size that is
directly proportional to the amount of live data in the heap.
However, it turns out that it is possible to build a recursive
copying collector that does not require additional space by
reusing already-scavenged space. This paper is a description of
this technique.

This paper describes a memory management discipline for programs
that perform dynamic memory allocation and de-allocation. At
runtime, all values are put into regions. The store consists of a
stack of regions. All points of region allocation and
de-allocation are inferred automatically, using a type and effect
based program analysis. The scheme does not assume the presence of
a garbage collector. The scheme was first presented in 1994 (M.
Tofte and J.-P. Talpin, in Proceedings of the 21st ACM
SIGPLAN-SIGACT Symposium on Principles of Programming Languages,
pp. 188–201); subsequently, it has been tested in the ML Kit with
Regions, a region-based, garbage-collection free implementation of
the Standard ML Core Language, which includes recursive datatypes,
higher-order functions and updatable references (L. Birkedal, M.
Tofte, and M. Vejlstrup, (1996), in Proceedings of the 23rd ACM
SIGPLAN-SIGACT Symposium on Principles of Programming Languages,
pp. 171–183). This paper defines a region-based dynamic semantics
for a skeletal programming language extracted from Standard ML. We
present the inference system which specifies where regions can be
allocated and de-allocated and a detailed proof that the system is
sound with respect to a standard semantics. We conclude by giving
some advice on how to write programs that run well on a stack of
regions, based on practical experience with the ML Kit.

Many interactive computing environments provide automatic storage
reclamation and virtual memory to ease the burden of managing
storage. Unfortunately, many storage reclamation algorithms impede
interaction with distracting pauses. Generation Scavenging is a
reclamation algorithm that has no noticeable pauses, eliminates
page faults for transient objects, compacts objects without
resorting to indirection, and reclaims circular structures, in one
third the time of traditional approaches.

One of the most promising automatic storage reclamation
techniques, generation-based storage reclamation, suffers poor
performance if many objects live for a fairly long time and then
die. We have investigated the severity of the problem by
simulating Generation Scavenging automatic storage reclamation
from traces of actual four-hour sessions. There was a wide
variation in the sample runs, with garbage-collection overhead
ranging from insignificant, during interactive runs, to sever,
during a single non-interactive run. All runs demonstrated that
performance could be improved with two techniques: segregating
large bitmaps and strings, and mediating tenuring with demographic
feedback. These two improvements deserve consideration for any
generation-based storage reclamation strategy.

On C/Unix systems, the malloc interface is standard for dynamic
memory allocation. Despite its popularity, malloc’s shortcomings
frequently cause programmers to code around it. The new library
Vmalloc generalizes malloc to give programmers more control over
memory allocation. Vmalloc introduces the idea of organizing
memory into separate regions, each with a discipline to get raw
memory and a method to manage allocation. Applications can write
their own disciplines to manipulate arbitrary type of memory or
just to better organize memory in a region by creating new regions
out of its memory. The provided set of allocation methods include
general purpose allocations, fast special cases and aids for
memory debugging or profiling. A compatible malloc interface
enables current applications to select allocation methods using
environment variables so they can tune for performance or perform
other tasks such as profiling memory usage, generating traces of
allocation calls or debugging memory errors. A performance study
comparing Vmalloc and currently popular malloc implementations
shows that Vmalloc is competitive to the best of these allocators.
Applications can gain further performance improvement by using the
right mixture of regions with different Vmalloc methods.

Garwick’s algorithm, for repacking LIFO lists stored in a
contiguous block of memory, bases the allocation of remaining
space upon both sharing and previous stack growth. A system
whereby the weight applied to each method can be adjusted
according to the current behaviour of the stacks is discussed.

We also investigate the problem of determining during memory
repacking that the memory is used to saturation and the driving
program should therefore be aborted. The tuning parameters studied
here seem to offer no new grasp on this problem.

GC systems allocate and reuse memory cyclically; this imposes a
cyclic pattern on memory accesses that has its own distinctive
locality characteristics. The cyclic reuse of memory tends to
defeat caching strategies if the reuse cycle is too large to fit
in fast memory. Generational GCs allow a smaller amount of memory
to be reused more often. This improves VM performance, because the
frequently-reused area stays in main memory. The same principle
can be applied at the level of high-speed cache memories, if the
cache is larger than the youngest generation. Because of the
repeated cycling through a fixed amount of memory, however,
generational GC interacts with cache design in unusual ways, and
modestly set-associative caches can significantly outperform
direct-mapped caches.

While our measurements do not show very high miss rates for GCed
systems, they indicate that performance problems are likely in
faster next-generation systems, where second-level cache misses
may cost scores of cycles. Software techniques can improve cache
performance of garbage-collected systems, by decreasing the cache
“footprint” of the youngest generation; compiler techniques that
reduce the amount of heap allocation also improve locality. Still,
garbage-collected systems with a high rate of heap allocation
require somewhat more cache capacity and/or main memory bandwidth
than conventional systems.

Pointer swizzling at page fault time is a novel address
translation mechanism that exploits conventional address
translation hardware. It can support huge address spaces
efficiently without long hardware addresses; such large address
spaces are attractive for persistent object stores, distributed
shared memories, and shared address space operating systems. This
swizzling scheme can be used to provide data compatibility across
machines with different word sizes, and even to provide binary
code compatibility across machines with different hardware address
sizes.

Pointers are translated (“swizzled”) from a long format to a
shorter hardware-supported format at page fault time. No extra
hardware is required, and no continual software overhead is
incurred by presence checks of indirection of pointers. This
pagewise technique exploits temporal and spatial locality in much
the same way as normal virtual memory; this gives it many
desirable performance characteristics, especially given the trend
toward larger main memories. It is easy to implement using common
compilers and operating systems.

We survey basic garbage collection algorithms, and variations such
as incremental and generational collection; we then discuss
low-level implementation considerations and the relationships
between storage management systems, languages, and compilers.
Throughout, we attempt to present a unified view based on abstract
traversal strategies, addressing issues of conservatism,
opportunism, and immediacy of reclamation; we also point out a
variety of implementation details that are likely to have a
significant impact on performance.

Dynamic memory allocation has been a fundamental part of most
computer systems since roughly 1960, and memory allocation is
widely considered to be either a solved problem or an insoluble
one. In this survey, we describe a variety of memory allocator
designs and point out issues relevant to their design and
evaluation. We then chronologically survey most of the literature
on allocators between 1961 and 1995. (Scores of papers are
discussed, in varying detail, and over 150 references are given.)

We argue that allocator designs have been unduly restricted by an
emphasis on mechanism, rather than policy, while the latter is
more important; higher-level strategic issues are still more
important, but have not been given much attention.

Most theoretical analyses and empirical allocator evaluations to
date have relied on very strong assumptions of randomness and
independence, but real program behavior exhibits important
regularities that must be exploited if allocators are to perform
well in practice.

A new buddy system is described in which the region of storage
being managed is partitioned into two sub-regions, each managed by
a fairly standard “binary” buddy system. Like the weighted buddy
systems of Shen and Peterson, the block sizes are of sizes 2n+1 or 3·2n, but unlike theirs
there is no extra overhead for typing information or for buddy
calculation, and an allocation which requires splitting an extant
available block only rarely creates a block smaller than the one
being allocated. Such smaller blocks are carved out only when the
boundary between the two subregions floats; the most interesting
property of this system is that the procedures for allocation and
deallocation are designed to keep blocks immediately adjacent to
the subregion boundary free, so that the boundary may be moved
within a range of unused space without disturbing blocks in use.
This option is attained with a minimum of extra computation beyond
that of a binary buddy system, and provides this scheme with a new
approach to the problem of external fragmentation.

The two-pass compaction algorithm of F.L. Morris, which follows
upon the mark phase in a garbage collector, may be modified to
recover reference counts for a hybrid storage management system.
By counting the executions of two loops in that algorithm where
upward and downward references, respectively, are forwarded to the
relocation address of one node, we can initialize a count of
active references and then update it but once. The reference count
may share space with the mark bit in each node, but it may not
share the additional space required in each pointer by Morris’s
algorithm, space which remains unused outside the garbage
collector.

A project to design a pair of memory chips with a modicum of
intelligence is described. Together, the two allow simple
fabrication of a small memory bank, a heap of binary (LISP-like)
nodes that offers the following features: 64-bit nodes; two
pointer fields per node up to 29 bits each; reference counts
implicitly maintained on writes; 2 bits per node for marking
(uncounted) circular references; 4 bits per node for
conditional-store testing at the memory; provision for
processor-driven, recounting garbage collection.

A stop-and-copy garbage collector updates one-bit reference
counting with essentially no extra space and minimal memory cycles
beyond the conventional collection algorithm. Any object that is
uniquely referenced during a collection becomes a candidate for
cheap recovery before the next one, or faster recopying then if it
remains uniquely referenced. Since most objects stay uniquely
referenced, subsequent collections run faster even if none are
recycled between garbage collections. This algorithm extends to
generation scavenging, it admits uncounted references from roots,
and it corrects conservatively stuck counters, that result from
earlier uncertainty whether references were unique.

Identifying some pointers as invisible threads, for the purposes
of storage management, is a generalization from several widely
used programming conventions, like threaded trees. The necessary
invariant is that nodes that are accessible (without threads) emit
threads only to other accessible nodes. Dynamic tagging or static
typing of threads ameliorates storage recycling both in functional
and imperative languages.

We have seen the distinction between threads and links sharpen
both hardware- and software-supported storage management in
SCHEME, and also in C. Certainly, therefore, implementations of
languages that already have abstract management and concrete
typing, should detect and use this as a new static type.

A hardware self-managing heap memory (RCM) for languages like
LISP, SMALLTALK, and JAVA has been designed, built, tested and
benchmarked. On every pointer write from the processor,
reference-counting transactions are performed in real time within
this memory, and garbage cells are reused without processor
cycles. A processor allocates new nodes simply by reading from a
distinguished location in its address space. The memory hardware
also incorporates support for off-line, multiprocessing,
mark-sweep garbage collection.

Performance statistics are presented from a partial implementation
of SCHEME over five different memory models and two garbage
collection strategies, from main memory (no access to RCM) to a
fully operational RCM installed on an external bus. The
performance of the RCM memory is more than competitive with main
memory.

A group at Symbolics is developing a Lisp runtime kernel, derived
from its Genera operating system, to support real-time control
applications. The first candidate application has strict
response-time requirements (so strict that it does not permit the
use of paged virtual memory). Traditionally, Lisp’s automatic
storage-management mechanism has made it unsuitable to real-time
systems of this nature. A number of garbage collector designs and
implementations exist (including the Genera garbage collector)
that purport to be “real-time”, but which actually have only
mitigated the impact of garbage collection sufficiently that it
usually goes unnoticed by humans. Unfortunately,
electro-mechanical systems are not so forgiving. This paper
examines the limitations of existing real-time garbage collectors
and describes the avenues that we are exploring in our work to
develop a CLOS-based garbage collector that can meet the real-time
requirements of real real-time systems.

The thesis of this project is that incremental collection can be
done feasibly and efficiently in an architecture and compiler
independent manner. The design and implementation of an
incremental, generational mostly-copying garbage collector for C++
is presented. The collector achieves, simultaneously, real-time
performance (from incremental collection), low total garbage
collection delay (from generational collection), and the ability
to function without hardware and compiler support (from
mostly-copying collection).

The incremental collector runs on commercially-available
uniprocessors, such as the DECStation 3100, without any special
hardware support. It uses UNIX’s user controllable page protection
facility (mprotect) to synchronize between the scanner (of the
collector) and the mutator (of the application program). Its
implementation does not require any modification to the C++
compiler. The maximum garbage collection pause is well within the
100-millisecond limit imposed by real-time applications executing
on interactive workstations. Compared to its non-incremental
version, the total execution time of the incremental collector is
not adversely affected.

An algorithm for real-time garbage collection is presented, proved
correct, and evaluated. This algorithm is intended for
list-processing systems on general-purpose machines, i.e., Von
Neumann style serial computers with a single processor. On these
machines, real-time garbage collection inevitably causes some
overhead on the overall execution of the list-processing system,
because some of the primitive list-processing operations must
check the status of garbage collection. By removing such overhead
from frequently used primitives such as pointer references (e.g.,
Lisp car and cdr) and stack manipulations, the presented algorithm
reduces the execution overhead to a great extent. Although the
algorithm does not support compaction of the whole data space, it
efficiently supports partial compaction such as array relocation.

This paper describes inprof, a tool used to study the memory
allocation behavior of programs. mprof records the amount of
memory each function allocates, breaks down allocation information
by type and size, and displays a program’s dynamic cal graph so
that functions indirectly responsible for memory allocation are
easy to identify. mprof is a two-phase tool. The monitor phase is
linked into executing programs and records information each time
memory is allocated. The display phase reduces the data generated
by the monitor and displays the information to the user in several
tables. mprof has been implemented for C and Kyoto Common Lisp.
Measurements of these implementations are presented.

This thesis shows that object-level, trace-driven simulation can
facilitate evaluation of language runtime systems and reaches new
conclusions about the relative performance of important garbage
collection algorithms. In particular, I reach the unexpected
conclusion that mark-and-sweep garbage collection, when augmented
with generations, shows comparable CPU performance and much better
reference locality than the more widely used copying algorithms.
In the past, evaluation of garbage collection algorithms has been
limited by the high cost of implementing the algorithms.
Substantially different algorithms have rarely been compared in a
systematic way.

With the availability of high-performance, low-cost workstations,
trace-driven performance evaluation of these algorithms is now
economical. This thesis describes MARS, a runtime system simulator
that is driven by operations on program objects, and not memory
addresses. MARS has been attached to a commercial Common Lisp
system and eight large Lisp applications are used in the thesis as
test programs. To illustrate the advantages of the object-level
tracing technique used by MARS, this thesis compares the relative
performance of stop-and-copy, incremental, and mark-and-sweep
collection algorithms, all organized with multiple generations.
The comparative evaluation is based on several metrics: CPU
overhead, reference locality, and interactive availability.

Mark-and-sweep collection shows slightly higher CPU overhead than
stop-and-copy ability (5 percent), but requires significantly less
physical memory to achieve the same page fault rate (30-40
percent). Incremental collection has very good interactive
availability, but implementing the read barrier on stock hardware
incurs a substantial CPU overhead (30-60 percent). In the future,
I will use MARS to investigate other performance aspects of
sophisticated runtime systems.

Stop-and-copy garbage collection has been preferred to
mark-and-sweep collection in the last decade because its
collection time is proportional to the size of reachable data and
not to the memory size. This paper compares the CPU overhead and
the memory requirements of the two collection algorithms extended
with generations, and finds that mark-and-sweep collection
requires at most a small amount of additional CPU overhead (3-6%)
but requires an average of 20% (and up to 40%) less memory to
achieve the same page fault rate. The comparison is based on
results obtained using trace-driven simulation with large Common
Lisp programs.

Garbage collection algorithms have been enhanced in recent years
with two methods: generation-based collection and Baker
incremental copying collection. Generation-based collection
requires special actions during certain store operations to
implement the “write barrier”. Incremental collection requires
special actions on certain load operations to implement the “read
barrier”. This paper evaluates the performance of different
implementations of the read and write barriers and reaches several
important conclusions. First, the inlining of barrier checks
results in surprisingly low overheads, both for the write barrier
(2%-6%) and the read barrier (&lt; 20%). Contrary to previous
belief, these results suggest that a Baker-style read barrier can
be implemented efficiently without hardware support. Second, the
use of operating system traps to implement garbage collection
methods results in extremely high overheads because the cost of
trap handling is so high. Since this large overhead is completely
unnecessary, operating system memory protection traps should be
reimplemented to be as fast as possible. Finally, the performance
of these approaches on several machine architectures is compared
to show that the results are generally applicable.

Cache performance is an important part of total performance in
modern computer systems. This paper describes the use of
trace-driven simulation to estimate the effect of garbage
collection algorithms on cache performance. Traces from four large
Common Lisp programs have been collected and analyzed with an
all-associativity cache simulator. While previous work has focused
on the effect of garbage collection on page reference locality,
this evaluation unambiguously shows that garbage collection
algorithms can have a profound effect on cache performance as
well. On processors with a direct-mapped cache, a generation
stop-and-copy algorithm exhibits a miss rate up to four times
higher than a comparable generation mark-and-sweep algorithm.
Furthermore, two-way set-associative caches are shown to reduce
the miss rate in stop-and-copy algorithms often by a factor of two
and sometimes by a factor of almost five over direct-mapped
caches. As processor speeds increase, cache performance will play
an increasing role in total performance. These results suggest
that garbage collection algorithms will play an important part in
improving that performance.

Dynamic memory management is an important part of a large class of
computer programs and high-performance algorithms for dynamic
memory management have been, and will continue to be, of
considerable interest. This paper presents empirical data from a
collection of six allocation-intensive C programs. Extensive
statistics about the allocation behavior of the programs measured,
including the distributions of object sizes, lifetimes, and
interarrival times, are presented. This data is valuable for the
following reasons: first, the data from these programs can be used
to design high-performance algorithms for dynamic memory
management. Second, these programs can be used as a benchmark test
suite for evaluating and comparing the performance of different
dynamic memory management algorithms. Finally, the data presented
gives readers greater insight into the storage allocation patterns
of a broad range of programs. The data presented in this paper is
an abbreviated version of more extensive statistics that are
publicly available on the internet.

Because dynamic memory management is an important part of a large
class of computer programs, high-performance algorithms for
dynamic memory management have been, and will continue to be, of
considerable interest. Experience indicates that for many
programs, dynamic storage allocation is so important that
programmers feel compelled to write and use their own
domain-specific allocators to avoid the overhead of system
libraries. Conservative garbage collection has been suggested as
an important algorithm for dynamic storage management in C
programs. In this paper, I evaluate the costs of different dynamic
storage management algorithms, including domain-specific
allocators; widely-used general-purpose allocators; and a publicly
available conservative garbage collection algorithm. Surprisingly,
I find that programmer enhancements often have little effect on
program performance. I also find that the true cost of
conservative garbage collection is not the CPU overhead, but the
memory system overhead of the algorithm. I conclude that
conservative garbage collection is a promising alternative to
explicit storage management and that the performance of
conservative collection is likely to be improved in the future. C
programmers should now seriously consider using conservative
garbage collection instead of malloc/free in programs they write.

Because dynamic memory management is an important part of a large
class of computer programs, high-performance algorithms for
dynamic memory management have been, and will continue to be, of
considerable interest. We evaluate and compare models of the
memory allocation behavior in actual programs and investigate how
these models can be used to explore the performance of memory
management algorithms. These models, if accurate enough, provide
an attractive alternative to algorithm evaluation based on
trace-driven simulation using actual traces. We explore a range of
models of increasing complexity including models that have been
used by other researchers. Based on our analysis, we draw three
important conclusions. First, a very simple model, which generates
a uniform distribution around the mean of observed values, is
often quite accurate. Second, two new models we propose show
greater accuracy than those previously described in the
literature. Finally, none of the models investigated appear
adequate for generating an operating system workload.