CXXR: Refactoring History

This page describes the phases so far completed within the CXXR project
to refactor the R engine into C++. Each phase is placed within the
Subversion tags directory, with a name of the form 0.00-2.5.0,
where 0.00 indicates the phase, and 2.5.0
indicates the R release to which that phase is intended to correspond.

Phase 0: 0.00-2.5.0

In this phase all .cpp files within src/main
are renamed to .cpp, with the following exceptions:

complex.c: This file uses the C99 complex types, which
are not (under the current C++ standard) understood by a C++ compiler;

gram.c: This file is automatically generated by
yacc/bison;

regex.c: The source of this file is very insistent that
it is C, not C++: it gives a #warning if you attempt to
compile it with a C++ compiler.

(Subsequently, RNG.c was also reverted to C, to respect
Knuth's copyright statement.)

The result of this phase does not build correctly; however, it is useful
as a baseline for seeing the subsequent changes.

Phase 1: 0.01-2.5.0

Make such changes to the result of Phase 0 to enable the .cpp
files to compile without warning using -Wall with gcc-4.1.3,
retaining C linkage conventions for everything defined in .h
files. Ensure that the whole of R will build correctly and pass make
check.

A desirable side effect of enforcing C linkage was that the linkage
editor picked up several instances where the source file implementing a
function failed to #include the appropriate header file, and
consequently generated a function with C++ linkage: see below.

This needed to address the following issues:

Rboolean is different from C++ bool. Rboolean
is an enumeration with elements FALSE=0 and TRUE=1;
bool is a primitive type, with values false
and true. (Also, there are #defines of FALSE
to 0 and TRUE to 1 lurking around in the R code, just to confuse
matters.) In particular an Rboolean is a different size
from a bool. It was necessary to introduce many explicit
conversions from bool (resulting in C++ from evaluating
Boolean expressions) or integer types to Rboolean.

In connection with this, defined a macro RBOOL(x)
within Rinlinedfuns.h expands to x in C
and Rboolean(x) in C++.

The C++ keywords class, new, private
and this were used as identifiers; these had to be
renamed, e.g. class changed to connclass.

In various places, particularly connections.cpp, a void*
was implicitly converted to another type of pointer. These conversions
were made explicit, and flagged /*CCAST*/.

datetime.cpp and memory.cpp used statements
of the form i -= d; where i is of integer
type and d is an expression evaluating to a floating point
type. This was converted to the form i = int(i - (d)); to
avoid a compiler warning. This interpretation complies with
sec.Â 6.5.12.2 of the C99 standard ISO:IEC
9899:1999.

The structure type NewDevDesc defined in GraphicsDevice.h
contains a number of pointers to functions as members, and the types of
these functions were specified without giving the number and types of
the function arguments. This was rectified. It was also necessary to
give this structure a tag (_NewDevDesc) because most of
these functions included a pointer to a NewDevDesc among
their arguments.

It was necessary to shift some of the material in R_ext/GraphicsEngine.h,
in particular the definition of R_GE_context, into a new
header file R_ext/GraphicsContext.h, to avoid reciprocal
dependencies between GraphicsEngine.h and GraphicsDevice.h.

The pointer to function type CCODE, defined in Defn.h,
was redefined to make the number and type of its arguments explicit, as
follows:

typedef SEXP (*CCODE)(SEXP, SEXP, SEXP, SEXP);

If __MAIN__ is defined, libextern.h#defined
extern to the empty string, which could play havoc with the
extern "C" used in C++ to enforce C-style linkage. This #define
was commented out, and instead a new macro extern1 was #defined
within Defn.h.

In numerous places it was necessary to make conversions from
floating-point types to integer types explicit. In other places it was
clear that the same effect could be achieved without deleterious side
effect by changing the type of a variable.

It was necessary to introduce reinterpret_casts in
various places in memory.cpp, scan.cpp, serialize.cpp
and vfonts.cpp. (In future it is the intention to get rid
of as many of these as possible, as well as getting rid of all C-style
casts.)

In Defn.h, the whole declaration extern FUNTAB
R_FunTab[]; was made #ifndef __R_Names__, not
just the word extern.

In some places I couldn't resist changing the type of a function
argument from a plain pointer to a const pointer. We can
expect much more of this later, but this may have been premature.

sysutils.cpp (conditionally) contained an extern
declaration of environ; the compiler considered this to
have C++ linkage, conflicting with the C-linkage definition in unistd.h
(subsequently #included into sysutils.cpp).
This extern declaration has been itself replaced by a
(conditional) #include of unistd.h.

Sorted out problems where a file implementing a function failed to
include the relevant header file. In some cases this was because the
prototype didn't appear in any header file, and clients of the
function were instead relying on a prototype within the client source
file itself! Such misplaced prototypes were found in eval.cpp,
format.cpp, memory.cpp, platform.cpp,
printutils.cpp, and library/methods/src/methods_list_dispatch.c;
they were commented out, and flagged with the comment "Use header
files!". Needed prototypes that didn't appear in any header file were
generally placed at the end of Defn.h.

A particularly obscure example of this kind concerns R_CHAR.
This is declared as a pointer to a function in Rinternals.h,
and implemented in memory.cpp. Now memory.cpp
does #includeRinternals.h, but it does so
with USE_RINTERNALS defined, as a result of which the R_CHAR
declaration in the header file isn't seen by the compiler, and so the
implemented function got C++ linkage. I modified the header file by
moving the R_CHAR declaration outside the #ifndef
USE_RINTERNALS.

The definitions in print.cpp of functions intended to be
called from FORTRAN needed to be surrounded by extern "C"{
... }.

deparse.cpp:1191 used & where &&
was surely intended; character.cpp:738 similarly used |
instead of ||.

-Wall complains about attempts to compare signed with
unsigned. This required explicit conversions in numerous places.
Generally (but not always) I did this by converting unsigned to signed.
In other places it was clear that the same effect could be achieved
without deleterious side effect by changing the type of a variable.

In connection with this, the macro AGE_NODE in memory.cpp
had to be changed to make an__g__ unsigned.

Phase 2: 0.02-2.5.0

In a subsequent phases (possibly starting in Phase 3) it is our objective
to replace the SEXPREC union by a hierarchy of C++ classes. This phase
prepares for that by reorganising the material in the header files in src/include.
This involves creating a new subdirectory src/include/CXXR,
and within that creating a new header file RObject.h
(ultimately to include a base class RObject for the new
hierarchy), and further header files RClosure.h, REnvironment.h,
RInternalFunction.h, RPairList.h, RPromise.h,
RSymbol.h and RVector.h, corresponding
respectively to closxp_struct, envsxp_struct,
primsxp_struct, listsxp_struct, promsxp_struct,
symsxp_struct and vecsxp_struct, which will
eventually be derived classes. The material in these new headers comes
predominantly from Rinternals.h, but to some extent (in the
case of RInternalFunction.h) from Defn.h. All
of the new header files, with the exception of RInternalFunction.h,
are also installed in $(rincludedir)/CXXR.

Function prototypes moved into the new header files are documented using
doxygen. Where is was clearly
consistent with the semantics, some of the argument types of the functions
were changed, either by adding const, or by converting int
into Rboolean (however, see the issues below regarding the
latter).

The following are implementational details and issues that arose:

The implementation of SEXPREC (though still the
unchanged C code) was made visible only to C++ programs. This is to get
advance warning of potential problems when the implementation is
changed to C++.

In many places CR defined a name as a macro when USE_RINTERNALS
was defined, and otherwise as a function. It has been the intention in
this phase to replace the macros with C++ inline functions: these would
automatically also generate a non-inlined form, so the separate
definition (usually in memory.cpp) could be dispensed
with.

This was all very well where the function form was implemented in CR
simply by invoking the macro; however in some cases the function form
carried out some error checking before invoking the macro. Trying to
convert the macro to an inline function would then result in two
distinct functions with the same name, which the compiler and/or
linker would certainly reject.

In the end it was decided to leave the macros in place for the time
being: they'll have to be changed when the C++ implementation rolls
out anyway.

I considered getting rid of the USE_RINTERNALS
compilation conditions, but decided to retain it to mark out material
(usually currently in the form macro definitions) that will in the
future need privileged access to a C++ class. Only memory.cpp
now #defines USE_RINTERNALS.

Rinternals.h contained many #defines of
function names to the same name prefixed by Rf_: this
appears to correspond in C++ terms to putting these functions in a
namespace. I split these #defines out into a separate
header file Rf_namespace.h, which is #included
by RObject.h (which is in turn included by the other new
headers). There are various similar #defines scattered
around other CR header files, which may need to be moved into Rf_namespace.h
in due course.

I dithered about whether to name the file in question RInternalFunction.h
or RPrimitiveFunction.h. Usage in the CR code (e.g. primsxp_struct)
suggests the latter, and the R Internals document speaks of internal and
primitive functions as being mutually exclusive, but fails to give a
more general name covering any function handled via R_FunTab.
But it seems to be reasonable to regard primitive functions as a special
case of an internal function, hence the eventual choice of RInternalFunction.h.

It is noted that Rdynload.cpp and dotcode.cpp
each give compiler warnings under -pedantic because they
attempt to cast function pointers to void*. The source
code of the former already contains a comment saying that it's illegal
even in C. Not easy to fix, so leave for now.

It seems logical (!) that a logical vector (LGLSXP)
should contain items of type Rboolean rather than of type
int, and consequently that the macro/function LOGICAL(SEXP)
should return Rboolean* rather than int*. I
made some attempt to do this, but backed out of it for the following
reasons:

The .C interface expects these vectors to contain ints;

ISO14882:1998 says that in C++, subject to certain constraints, it
is implementation-defined which integral type is used as the
underlying type for an enumeration (though gcc happens
to use int for Rboolean).

ISO9899:1999 says much the same for C, but with differently worded
constraints.

In any case, despite the commented-out MAYBE value
in the enumeration, perhaps Rboolean is best thought
of as 'bool for C', rather than having any capability
to handle NAs.

Possible new policy: within functions visible from C, use Rboolean
as a substitute for C++ bool, possibly constrained to be
32Â bits long to avoid the enum implementation
dependencies noted above. However, R logical vectors will continue to
be represented using ints. (One day we might define an Rlogical
class - a wrapper round an int - to handle logical
vectors within C++, while C programs simply see typedef int
Rlogical;.)

Phase 3: 0.03-2.5.0

The primary objective of this phase was to redefine R_NilValue
as a null (i.e. zero) pointer of type SEXP. R_NilValue
is widely used within CR as a stub, i.e. to signify that something that
might be present is absent, in much the same way that a null pointer is
used within C or C++. However, in CR it is actually implemented in effect
as an element of a pairlist (i.e. struct listsxp), whose
CAR, CDR, TAG and attributes all point to itself. This would cause
difficulties in CXXR when we reimplement the SEXPREC union
as a type hierarchy, because pairlist elements will need to be of a
specific type within the hierarchy. If R_NilValue were given
this type, it would preclude its use as a general-purpose stub. But zero
is a possible value for a pointer of any type, so if we equate R_NilValue
to zero this will sidestep the problem.

Another disadvantage of the CR definition of R_NilValue is
that it needlessly introduces a cyclic data structure.

The following are implementational details and issues that arose in
carrying out this change:

The existing code in many places invokes functions/macros CAR,
CDR, TAG and ATTRIB on a SEXP
that may in fact be R_NilValue, expecting in this case for
each of these functions to return R_NilValue. These
functions were reimplemented to preserve this behaviour: i.e. each of
them returns a null pointer if passed a null pointer. At the same time
the macro forms were abolished: they are now implemented as inline
functions for C++, and ordinary functions if called from C.

In the same spirit, OBJECT and IS_S4_OBJECT
have been reimplemented to return FALSE if passed a zero
pointer. They too are now implemented as inline functions for C++, and
ordinary functions if called from C.

No such modification was made to NAMED: the policy here
is that the calling code should be modified as necessary to prevent it
being invoked for a null pointer. Deal similarly with invocations of SET_NAMED,
PRINTNAME, NODE_IS_MARKED, SET_ATTRIB,
SET_OBJECT, and LENGTH. (This last case is
interesting because LENGTH is meant to be applied to
vector objects, i.e. components of the SEXPREC union
different from struct listsxp.) The calling sites
concerned were determined by running make check at
top-level: doubtless many have slipped through the net!

Incidental to the above changes, some of the macros in memory.cpp
were replaced by inline functions.

A secondary objective of this phase was to get rid of C-style casts
within the C++ code, wherever the appropriate remedy was reasonably
obvious and straightforward. The following kinds of C-style casts were
left in place pending further work:

Casts from one function pointer type to another (often involving DL_FUNC);

Casts from one struct pointer type to another (often involving DevDesc
and GEDevDesc);

Use of the construct (void*)(-1);

Casts to/from R_varloc_t;

Other puzzling casts.

Addendum 2007/08/06: although make check works with this
release, make check-devel doesn't.

Phase 4: 0.04-2.5.1

The primary objective of this phase was to update the program to parallel
release 2.5.1 of R. This proved to be straightforward, except that it was
necessary to install a later version of svn_load_dirs.pl to
cope with filenames containing @ signs. (However, I was
surprised to discover that svn merge doesn't track renames.)

Other changes were as follows:

Bugs revealed by make check-devel were fixed. In general
this was done by modifying certain functions to behave reasonably if
passed a null pointer, namely LENGTH (returns 0), NAMED
(returns 0) and SET_NAMED (does nothing). These changes
obviated some of the changes made leading up to svn revision 49 (see
PhaseÂ 3 above), and these changes were accordingly reversed. make
check-all also now works, but it was time-consuming to run and
revealed no bugs.

I managed to get autoconf working properly, and
accordingly backed out of some configuration kludges I had made
previously.

Phase 5: 0.05-2.5.1

The aim of this phase was to create a branch entitled const,
to explore to what extent the R code is amenable to 'constifying': i.e.
converting pointers and C++ references wherever possible to const
pointers. Two preliminary steps, carried out in the trunk, were as
follows:

In the C++ source files in main, macros were replaced by inline
functions wherever it was reasonably straightforward to do so. (The
reason for doing this now was that during the constification process, it
was usually extremely difficult to see what the compiler was complaining
about if a multiline macro was involved.)

Similar changes were made to the header files under src/include:
however, the pattern here was to convert a macro to an inline function
if the header files was #included into a C++ file, and
to an out-of-line call to the same function if the header file was #included
into a C file.

This macro conversion was counterindicated in the following
circumstances:

The body of the macro was not syntactically equivalent to a
function call;

The macro used ##

The macro modified its arguments, e.g. something like

#define INC(x) ++(x)

(Using C++ reference arguments to get round this is not as
straightforward as it might seem.)

The macro referred to local variables at the point of call
(although in some cases such macros were converted to inline
functions with additional arguments);

In some cases macros were left in place if they expanded to a
single C/C++ expression or to a single macro invocation: it is the
multiline macros that are particularly opaque.

An incidental change was this: Until now, the type SEXPREC
was defined along the following lines:

typedef struct SEXPREC { ... } SEXPREC;

with the first occurrence of SEXPREC being what in C
would have been a structure tag. This has now been changed to:

typedef struct RObject { ... } SEXPREC;

exploiting the fact that in C++ RObject is a
fully-fledged class name. The header files in src/include/CXXR
now generally refer to RObject rather than SEXPREC.

Having established the const branch, constification was set
in train by the brute force measure of redefining SEXP to
mean const RObject* rather than simply RObject*;
a new typedef mapped vSEXP onto plain RObject*.
In the same spirit 'v' variants of many of the accessor
functions were introduced: for example now CAR takes a SEXP
argument and returns a SEXP, while vCAR takes
and returns a vSEXP. (Since these accessor functions are
required to be callable from C, we can't simply overload CAR.)

I then attempted to recompile various files, inserting 'v's
wherever the compiler demanded it. It quickly became apparent that these 'v's
were highly contagious: for example, both NA_STRING and R_EmptyEnv
had to be declared as vSEXPs rather than SEXPs.
This led me to the conclusion that it was premature to attempt
constification until I understand the evaluation process better.

At the time of tagging this release, the following files compile without
warnings in the const branch: memory.cpp, envir.cpp
and names.cpp. eval.cpp gives one compilation
error, when do_function attempts a non-const operation on
its op argument: fixing this would mean changing the
signature of all the do_ functions.

Phase 6: 0.06-2.5.1

In CR, each SEXPREC has a node class in the range 0 to 7.
Nodes of non-vector SEXPTYPE (i.e. not of types CHARSXP,
LGLSXP, INTSXP, REALSXP, CPLXSXP,
STRSXP, VECSXP, EXPRSXP, WEAKREFSXP
or RAWSXP) are all in class 0, and are 28 bytes long. Class
7 is used for vector nodes whose vector data amount to more than
128Â bytes; the remaining classes are used for smaller vectors, classified
according to their size. Nodes of class 7 are allocated directly using malloc;
nodes of the remaining classes are allocated from 'pages' about 2Â kB in
size, with each node class having its own pages. In CXXR it is intended to
replace SEXPRECs with an extensible class hierarchy (rooted
at RObject), so it will not be feasible to put a tight upper
bound on the size of non-vector nodes.

Another feature of CR is that in vector nodes, a single block of memory
contains the data of the vector preceded by a SEXPREC and
information about the length of the header. This is quite incompatible
with the design philosophy of C++, which is that the size of an object
must be deducible from its (C++) type: in particular ::operator
delete relies on this.

The purpose of Phase 6 was to circumvent these problems, and at the same
time to endeavour to decouple the code for allocating memory from the code
managing garbage collection. This comprised the following changes:

A new class CXXR::Heap was created to handle allocation
and deallocation of blocks of memory. This parallels CR to the extent
that requests for large blocks are passed on directly to ::operator
new, while requests for small blocks are satisfied by
allocating fixed-sized cells carved out of 'superblocks'. However, this
is an implementational detail and is not visible to the remainder of
CXXR: only the total number of bytes and the total number of blocks
allocated viaCXXR::Heap are visible (using
static member functions).

It is intended that CXXR::Heap will serve as a back-end
to implementations of operator new and to an STL-compatible Allocator
class. Note in particular that the blocks allocated from CXXR::Heap
are not exclusively used to create RObjects, but may be
used for any purpose where rapid allocation/deallocation of small
blocks is required.

Node classes have been abolished, and the garbage collector now treats
all nodes in the same way. In particular, following garbage collection,
all unused nodes are deallocated back to CXXR::Heap. (CR
deallocates only large vector nodes.)

The data of a vector object now resides in a separate block allocated
from CXXR::Heap; a data member m_data of RObject
(in due course to be factored out into a derived class) points to this
block. For non-vector objects, and vectors of size zero, m_data
is a null pointer. (CR appears to allocate at least 8 bytes of vector
data even when the nominal size of the vector is zero.)

In CR decisions about when to garbage collect, and how many
generations to collect are based (a) on the total number of nodes of
classes 0-6, and (b) the total size of the vector data in nodes of class
7 (reckoned in units of 8 bytes). In CXXR the same logic is used, but
based (a) on the total number of nodes, and (b) the total number of
bytes currently allocated from CXXR::Heap, divided by 8.

I was strongly tempted to base GC exclusively on (b), and to ignore
the number of nodes - after all, we're talking about a single resource
here: memory. I'd welcome opinions about this.

Phase 7: 0.07-2.5.1

The purpose of this phase was to encapsulate all the garbage-collection
logic within C++ classes. Five such classes were introduced, namely GCManager,
GCNode, GCEdge, GCRoot and WeakRef,
as now described.

Class GCManager, as the name implies, carries out
high-level management of garbage collection. It has no non-static data
or methods. When CXXR::Heap indicates (via a
callback) that it is on the point of requesting additional memory from
the operating system, method GCManager::gc() decides
whether to carry out a garbage collection, and if so how many
generations to collect. As comtemplated at tag 0.06-2.5.1, this decision
is now based only on the total memory allocated viaCXXR::Heap,
and not on the number of nodes allocated. If GCManager
decides to carry out a garbage collection, this is carried out by
calling GCNode::gc(), specifying the number of generations
to be collected.

Class GCNode is intended to be the base class for all
objects subject to garbage collection; RObject is now
derived from GCNode. All GCNodes are
threaded on circular doubly-linked lists according to their generation,
managed via the static private vector s_genpeg.
Element 0 of this vector represents the 'new' generation of nodes that
have not yet been exposed to the garbage collector; nodes that survive
garbage collection are moved into successively higher generations.

Templated class GCEdge<T>, where T
(defaulting to RObject*) is a pointer to a class type
derived from GCNode, represents a directed edge within the
directed graph whose nodes are the GCNodes. Whenever an
object of a type derived from GCNode wishes to refer to
another such object, it should do so by incorporating a GCEdge
encapsulating an appropriate pointer, rather than by incorporating the
pointer directly. The class provides for GCEdge<T>
to be implicitly converted to T in contexts which require
this.

GCEdge contains the logic for ensuring that a node in a
higher generation never includes a reference to an object in a younger
generation. If any attempt is made to direct a GCEdge
from an older node to a younger node, that younger node is immediately
promoted to the the generation of the older node, and this change is
propagated through the outgoing GCEdges of the younger
node, and so on recursively. (In other words, it implements the EXPEL_OLD_TO_NEW
logic that can be configured into CR (but is not the default for CR).)

Templated class GCRoot<T>, where T
(defaulting to RObject*) is a pointer to a class type
derived from GCNode, is intended to protect GCNodes
from the garbage-collector. A GCNode pointed to by a GCRoot
will not be garbage collected for as long as the GCRoot
object exists. The constructor and destructor of this class therefore
perform similar functions to the PROTECT/UNPROTECT
macros of CR, but within a C++ idiom, in which the programmer is spared
the need to check that PROTECTs are balanced by UNPROTECTs.
(However, PROTECT and UNPROTECT continue,
and will continue, to be available within CXXR.) The class provides for
GCRoot<T> to be implicitly converted to T
in contexts which require this.

The implementation of GCRoot uses an internal stack,
and consequently requires (and checks) that GCRoots are
destroyed in the reverse order of their creation. This should cause no
problem as long as only variables with automatic or static storage
duration are declared as GCRoots.

Despite successful experiments, the deployment of this class has been
deferred, pending the replacement of setjmp/longjmp
within CXXR by C++ exceptions. This is because destructors of C++
automatic variables are not called when the stack is unwound by longjmp
(see ISO14882:2003 sec. 18.7); they are when the stack is unwound by a
C++ exception.

Class WeakRef implements weak references (SEXPTYPE
WEAKREFSXP) in a way intended to be functionally identical to
CR. Each weak reference has a key and, optionally, a value and/or a
finalizer. The finalizer may either be a C/C++ function or an R object.

The garbage collector will consider the value and finalizer to be
reachable provided the key is reachable. If, during a garbage
collection, the key is found not to be reachable then the finalizer
(if any) will be run, and the weak reference object will be
'tombstoned', so that subsequent calls to key() and value() will
return null pointers. A weak reference object with a reachable key
will not be garbage collected even if the weak reference object is not
itself reachable.

Note that, in CXXR, weak references are not implemented as
four-element vectors, and the class has separate, appropriately typed
fields for R and C/C++ finalizers (though at most one of these fields
may be used in any particular WeakRef object).

Phase 8: 0.08-2.5.1

All uses of setjmp and longjmp (and sigsetjmp
and siglongjmp) within directory main have
been removed, and replaced by using JMPException, a C++
exception class designed as far as possible to be a drop-in replacement
for setjmp/longjmp. This is to ensure that
the destructors of C++ objects are invoked as the stack is unwound
following an exceptional condition.

Use of JMPException should be regarded as an interim
measure. Normal C++ coding practice is for throw simply
to report the exceptional condition that has arisen, rather than - as
with JMPException - in effect requesting a specific
subsequent flow of control.

The preferred way for C++ code to protect GCNodes from
the garbage collector is now to use the templated class GCRoot.
GCRoot's constructor will protect the GCNode
in question, and its destructor will unprotect it; there is therefore no
need for the programmer to remember to balance out the use of PROTECT
and UNPROTECT as in CR.

The facilities of CR's pointer protection stack (using e.g. PROTECT
and UNPROTECT) remain available, but the underlying
implementation has been rewritten in C++ as part of the GCRootBase
class. CXXR makes the additional requirement that when UNPROTECT
or REPROTECT are applied to a pointer, this is carried
out in the same context (RCNTXT) as that in which the
pointer was PROTECTed. This is to help pick up
mispairing between PROTECT and UNPROTECT.

Various CR header files, particularly Rinternals.h and Defn.h,
contain macro definitions of the form

#define func Rf_func

These serve to avoid name clashes (at least at the linker level) with
third-party packages; a similar purpose would be achieved in C++ by
placing the function func in a namespace Rf.
(In PhaseÂ 2 these macros were generally shifted into a separate
header file Rf_namespace.h, but this change has now been
reversed.) Using the preprocessor to modify program tokens in this way
is something that many C++ programs will shun, especially since some
of the tokens concerned (e.g. length) are likely to be
widely used. However abolishing these macros altogether would break
much existing code. Nevertheless, reliance on them is now deprecated
within CXXR, and in particular all header files within src/include
have been modified as necessary to include the Rf_
prefix explicitly where it is needed.

Phase 9: 0.09-2.6.1

The primary objective of this phase was to update the program to parallel
release 2.6.1 of R.

Other changes were as follows:

In previous work, the tendency has been progressively to move
function prototypes from CR's header files into the relevant
class-oriented (or at least data-type-oriented) header files within include/CXXR,
and at the same time to add doxygen documentation. This has now been
modified into a policy of copying the prototypes into the
relevant CXXR header file, and adding documentation there, but leaving
the prototype also in the CR header file. This will make it easier to
track changes in function signatures when we upgrade to future releases
of R. To this end a script allincludes.pl
has been produced. This generates an (otherwise trivial) C++ source file
that #includes all the header files under src/main
and src/include; compiling this file checks that the
prototypes in the CXXR header files are consistent with those in the CR
headers.

In the light of this change, the policy regarding the Rf_
prefix described under PhaseÂ 8 has been modified. Whilst all header
files in the CXXR directory should use the Rf_
prefix explicitly, header files derived from CR (e.g. Rinternals.h
and Defn.h) should normally omit the prefix if the
corresponding CR file does so.

All macros with arguments have been removed from the header files in
the CXXR directory.

All C-style casts have been removed from the C++ code. (Unfortunately,
under some Linuxen at least, standard signals such as SGN_DFL
are defines as macros in terms of C-style casts, so main.cpp
still gives warnings if compiled using gcc with -Wold-style-cast.)

PhaseÂ 10: 0.10-2.6.1

The primary objective of this phase was to reimplement all vector data
types as C++ classes derived (directly or indirectly) from RObject,
rather than using vecsxp_struct within the RObject::u
union. vecsxp_struct has not yet been eliminated entirely,
however, because of some straggling uses of truelength.

Other changes were as follows:

The memory blocks allocated by R_alloc and kindred
functions are no longer implemented as objects inheriting from RObject.
Instead these blocks are managed separately via a new class RAllocStack.
When the stack size is reduced using vmaxset, the memory
blocks are released immediately, rather than being left to the garbage
collector.

The levels of valgrind instrumentation have been modified somewhat, as
explained in the porting page.

A concept of 'infant immunity' was introduced into garbage collection:
see the GCNode
documentation. Roughly speaking, this means that an object of a
class derived from GCNode is immune from garbage
collection while it is being constructed, leading to considerable
simplification.

The templated class GCEdge was abolished: it was felt
that the advantage of encapsulating the write barrier within a single
class was outweighed by various knock-on obscurities.

Functions HASHASH, SET_HASHASH and SET_HASHVALUE
abolished: the new class CXXR::String will compute and
cache hash values automatically on demand.

CXXR generally prepared for its first public release, particularly by
improving documentation.

PhaseÂ 11: 0.11-2.6.2

The primary objective of this phase was to update the program to parallel
release 2.6.2 of R. Errors and warnings given by make check-devel
were also corrected.

Phase 12: 0.12-2.6.2

The primary objective of this phase was to eliminate the RObject::u
union completely, replacing its remaining elements with classes derived
from RObject. This entailed the creation of the following
classes: BuiltInFunction, ByteCode, Closure,
DottedArgs, Environment, Expression,
ExternalPointer, PairList, Promise,
SpecialSymbol and Symbol. Several loose ends
remain to be tied up, however; in particular, the remaining data members
of RObject ought all to be private.

Other changes were as follows:

Class CXXR::Heap has been renamed CXXR::MemoryBank
to avoid confusion with standard data structures called heaps.

Classes GCNode and GCRootBase are now
initialized using a Schwarz counter, thus enabling certain standard
objects (e.g. the 'not available' string, and the global environment) to
be declared as static class members: it is no longer necessary to wait
until InitMemory() has been called before creating them.
This in turn simplifies the implementation of the garbage collection
algorithm, which no longer has to treat these objects specially.
Concomitant with this change, the R interpreter now terminates by
throwing an exception of class ExitException, which
ensures that all GCRoot objects are destroyed in the
reverse order of their creation.

String objects now belong to one of two subclasses, CachedString
and UncachedString, with the former being the preferred
implementation. At any time, at most one CachedString with
given text and encoding will exist; to enforce this, the class
constructor is private, and instead clients use the static method obtain()
(accessible from C via the function mkChar()) to get a
pointer to a CachedString object with specified text and
encoding. The implementation of the cache is different from that used in
CR, and is based on the C++ standard library; it has the advantage that
cached strings do not need any special handling by the garbage
collector. There are no facilities for modifying the text or encoding of
a CachedString once it has been created; in particular the
function CHAR_RW() can be used only on UncachedString
objects.

Phase 13: 0.13-2.6.2

This phase was an attempt - less successful than was hoped! - to close
the gap in speed between CR and CXXR. Principal changes were:

Small blocks of memory are now allocated from preallocated pools
controlled by a new class CellHeap. CellHeap
differs from CellPool (used previously for this purpose)
in that whenever a memory block is requested from a CellHeap,
the allocated block will always be the one with the lowest address among
the available blocks. This is achieved using a skew heap data structure,
and is intended to increase the spatial localisation of successively
allocated blocks. Where the underlying OS provides posix_memalign(),
the superblocks from which memory blocks are allocated are aligned with
memory pages.

MemoryBank now uses CellHeaps with more
closely spaced block sizes than were used previously, to avoid wasting
space in cache lines.

When a GCNode object has its generation changed as a
result of write barrier enforcement or by being exposed to the garbage
collector, it is no longer immediately shifted to the list appropriate
to its new generation. Instead this is deferred until the sweep phase of
a garbage collection visits the node. This avoids pulling nodes into the
processor cache unnecessarily, and paves the way for the following
change.

The lists via which GCNode manages garbage
collection are now singly-linked rather than doubly-linked. This and
other changes mean that the size of a PairList node (cons
cell) has been reduced (on 32-bit architecture) from 40Â bytes to
32Â bytes. DumbVector nodes have been reduced in size by
12Â bytes.

The garbage collection algorithm now endeavours as far as possible to
deallocate nodes in the reverse order of allocation. This is because
class CellHeap works particularly efficiently if memory
blocks are released in decreasing address order.

The protocol by which newly created GCNode objects are
exposed to the garbage collector has been simplified and streamlined to
avoid pulling nodes into the cache unnecessarily. First, GCNode::expose()
exposes only the node for which it is invoked; it does not look for
unexposed descendants of this node. Secondly, protecting a node from the
garbage collector (e.g. using GCRoot<T> or PROTECT())
no longer automatically exposes the node. (However, write barrier
enforcement will continue to expose nodes if an exposed node is modified
to refer to an unexposed node, and this exposure will propagate to
descendants: this falls out automatically from the write barrier
enforcement algorithm.)

A bug whereby certain nodes were never exposed to GC has been
corrected.

GCNode::operator new no longer zeroes the memory it
allocates.

Phase 14: 0.14-2.7.1

The objective of this phase was to update CXXR to parallel release 2.7.1
of R. However, other changes are:

We have eliminated several uses of dynamic_cast from the
'glue layer' between code inherited from CR and new CXXR code. (dynamic_cast
can be surprisingly slow.)

SET_TYPEOF() has been abolished.

R_NilValue is now defined as a macro expanding to NULL
(which will in turn typically expand to (void*)0 in C and
simply to 0 in C++). Previously it was defined as

SEXP R_NilValue = 0;

which necessitated unnecessary memory fetches.

Phase 15: 0.15-2.7.1

The objective of this phase was to tidy up the class hierarchy rooted at
RObject, and in particular to give RObject
itself a more distinctive class identity, i.e. for it to be less of a
ragbag for things that hadn't yet been accommodated elsewhere. Principal
changes were:

Class RObject now controls attributes more closely. The
attributes (if present) must now be a PairList, each of
whose elements must have a distinct symbol as its tag. No attribute may
have a null value. The m_has_class field is automatically
set according to whether or not there is a class attribute; consequently
SET_OBJECT() has been abolished. However, the class
interface does not yet enforce all necessary consistency conditions on
attributes; these are still applied by the code in attrib.cpp.

The m_debug field of RObject has been
abolished. Instead the Closure and Environment
classes each contain a field controlling debugging.

The m_trace field of RObject has been
moved to a new class FunctionBase, from which the Closure
and BuiltinFunction classes are now derived.

The m_flags field of RObject, which
replaced the gp ('general purpose') field within sxpinfo_struct,
has been abolished. It has been replaced by various special-purpose
fields, placed as far down the class hierarchy as is practical at
present. A virtual function packGPBits() is used to
reconstitute the old gp ('levels') word for the sole purpose of
serialization; virtual function unpackGPBits() is
correspondingly used during deserialization. (However, not all of the
fields that have replaced m_flags need to be
serialized/deserialized.)

A new class HandlerEntry, defined locally within errors.cpp,
is used to handle error handler entries, rather than using a ListVector
for this purpose. This avoids the former use of the m_flags
field here.

Code inherited from CR is apt to hand out non-const
pointers to objects that really ought to be immutable, R_UnboundValue
for example. To counter this, RObject now has a Boolean
field m_frozen: non-const member functions in
the RObject hierarchy can now apply a run-time check that
their object has not been frozen. In particular, attempting to change
the attributes of a frozen object gives rise to an error.

String is now an abstract class. CachedString
objects are now frozen by the constructor. R_NaString is
also frozen.

Class SpecialSymbol has now been merged into Symbol.
Entities such as R_UnboundValue, which were formerly
implemented as SpecialSymbol objects, are now implemented
as frozen Symbols.

Phase 16: 0.16-2.7.2

The objective of this phase was to update CXXR to parallel release 2.7.2
of R.

An innovation in carrying out this phase was the introduction of a
Perl script uncxxr.pl. Where a source file inherited from
CR - foo.c, say - has been adapted for CXXR (and changed
into a C++ file foo.cpp in the process), this script
endeavours as far as possible to reverse systematic changes (e.g. the
conversion of C-style casts into C++ casts) to generate a quasi-C file foo.bakc.
(We say 'quasi-C' file because the resulting file may not be
syntactically correct C: it is intended for human eyes only.) Updating
to a new release of R is facilitated by using a 3-way visual diff
between the release of foo.c currently shadowed by CXXR,
the new release of foo.c, and foo.bakc. This
helps to highlight where the significant changes are in the new release
of foo.c, and where they might conflict with changes made
in CXXR. (A similar 3-way comparison using foo.cpp instead
of foo.bakc throws up too much 'noise'.)

Some changes have been made to the CXXR files, particularly in the use
of whitespace, to improve the effectiveness of uncxxr.pl.
However, this has so far only been done for C++ source files that needed
to be changed in any case as part of the upgrade to 2.7.2.

PhaseÂ 17: 0.17-2.7.2

The primary purpose of this phase was to reimplement the functionality of
duplicate1() in duplicate.cpp using class copy
constructors and a virtual function RObject::clone(),
reimplemented as necessary in derived classes. The following changes were
associated with this:

GCNode::expose() is once again recursive in effect, thus
reversing a change made in PhaseÂ 13. Cloning a node often requires
cloning an entire subgraph of the node graph, via recursive
calls of clone() to copy subobjects. The approach taken is
that while the copy subgraph is under construction, none of its
constituent nodes is exposed to the garbage collector: in particular clone()
itself does not expose the objects it creates to the collector. Only
when the copy subgraph is complete is the whole subgraph exposed, and to
do this the code that called to 'topmost' clone() must
then apply the newly-recursive expose() function to the
pointer that clone() returned. (Trying to expose nodes
individually as the construction proceeded meant that they were at risk
of being snatched away by the garbage collector before the subgraph was
complete: it is difficult to work around this in a way that sits easily
with C++ programming idioms.)

GCNode::devolveAge(), used in enforcing the write
barrier, has been renamed propagateAge(), and this
function remains recursive in effect. However, at the time of call, propagateAge(const
GCNode* node) changes the generation number only of node
(if necessary); the recursive propagation of this change is deferred
until the start of the next garbage collection. (Unfortunately the same
technique cannot be applied to expose() for a reason
explained in its documentation.)

Not all classes derived from RObject are clonable, and
for unclonable types, clone() returns a null pointer. When
a copy constructor copies a pattern object containing a subobject of an
unclonable type, the object constructed will at the appropriate point
simply contain a pointer to the subobject of the pattern object, rather
than to a clone of that subobject. This copying logic is encapsulated in
a templated 'smart pointer' type RObject::Handle<T>,
and for example the 'car' pointer of a PairList object is
now a Handle<RObject>. Similarly, the former
templated class EdgeVector<T> has been replaced by HandleVector<T>
which - as the name suggests - is implemented using a std::vector<CXXR::RObject::Handle<T>Â >.

Phase 18: 0.18-2.8.1

The objective of this phase was to update CXXR to parallel release 2.8.1
of R.

The uncxxr.pl script (see Phase 16) has been somewhat
further developed, and a larger number of C++ files derived directly
from CR have been tweaked so that uncxxr.pl can
back-convert them more accurately to their CR form.

Within C++ files derived directly from CR, reinterpret_cast
has been replaced by static_cast wherever this possible
without artifice. This has been facilitated by the introduction of a
function CXXR_alloc, which does the same job as R_alloc,
but - like malloc but unlike R_alloc -
returns void* rather than char*. (uncxxr.pl
converts CXXR_alloc back to R_alloc.)

Phase 19: 0.19-2.8.1

The primary purpose of this phase was to refactor environments, to pave
the way for introducing provenance-tracking features into R. The following
changes were associated with this:

The C++ Symbol class now enforces the requirement that
(except for certain special Symbols), there is at most one
Symbol with a given name. (CR enforces a similar
requirement, but less comprehensively, using the install()
function.) To facilitate this, it is now a requirement that a Symbol's
name be a CachedString object, rather than any String
object.

In CR (and formerly in CXXR), SYMSXP objects contained a
pointer to an arbitrary object, which was considered to be the Symbol's
value within R's base environment and base namespace. Objects of the C++
Symbol class no longer contain such a pointer, and the base
environment and base namespace are implemented in exactly the same way
as other Environment objects.

Similarly, in CR (and formerly in CXXR), SYMSXP objects
contained a pointer to an R object of a function type, which was used
when the Symbol was used as the name of a function invoked
via R's .Internal() interface. Objects of the C++ Symbol
class no longer contain such a pointer; instead the relevant mapping is
defined by the C++ class DotInternalTable.

The 'global cache' of Environments on the search path
has been abolished, at least for the time being.

A new C++ class Frame has been introduced, inheriting
from GCNode but not from RObject. A Frame
defines a mapping from Symbol objects to arbitrary RObjects.

Each Environment object now contains a pointer to a Frame
object, which defines its 'local frame'. The base environment and the
base namespace have the same Frame.

Frame itself is an abstract class, allowing different
implementations along the lines provided by the RObjectTables
package to be achieved simply by class inheritance. In most cases,
however, the concrete class StdFrame is used, in which the
mapping from Symbols to RObjects is provided
by a hash table, implemented using class unordered_map
from the TR1 extensions to the C++ standard library. This
implementational detail is not made visible to R code.

The interface to MemoryBank::allocate() has been changed
to allow the caller to specify that the call shall not result in a
garbage collection. Class CXXR::Allocator uses this to
ensure that manipulations of standard containers using CXXR::Allocator
do not result in reentrant calls to the standard library code, which
might otherwise happen if the garbage collector attempted to delete
objects handled by the container.

Phase 20: 0.20-2.8.1

The purpose of this phase was extensively to reengineer garbage
collection. This was to pave the way to experimentation with
reference-counting approaches to garbage collection; however, release 0.20-2.8.1
itself still uses generational mark-sweep. A major change has been in the
way of implementing 'infant immunity', whereby nodes that are under
construction are not liable to garbage collection; the following is a
summary of the way in which this has evolved. The phrase 'infant nodes'
means nodes that are either under construction, or whose construction is
complete but which have not yet been exposed to garbage collection by
calling GCNode::expose().

In previous releases, infant nodes were simply ignored during the
sweep phase of a mark-sweep collection, and so left in place. This had
the disadvantage that the infant immunity did not automatically extend
to subobjects of an infant node. In the PairList copy
constructor, for example, the copied list was created working forwards
along the pattern list, but then the whole structure of the copied list
would then need to be traversed again to expose its nodes to garbage
collection. (This was achieved by having GCNode::expose()
automatically recurse to subobjects.)

An alternative approach explored in the development of release 0.20-2.8.1
was to regard infant nodes as reachable during mark-sweep. So, during a
mark-sweep garbage collection, all the infant nodes and their
descendants would automatically be marked. So the PairList
copy constructor can expose the second and subsequent nodes of the
copied list immediately it has created them, leaving only the head of
the list unexposed, and thus conferring immunity from garbage collection
on the whole structure. There is no longer any need for expose()
to recurse to subobjects. The snag with this approach was that during
the mark phase, the Marker visitor could invoke the visitReferents()
method of objects whose construction is not yet complete, and which may
therefore contain junk pointers. Obviously, if a visitor was directed to
a junk address, that would probably crash the interpreter. The
workaround for this was to have GCNode::operator new zero
out the memory it allocated for new GCNode objects, so
that instead of junk pointers, an object under construction would
contain null pointers, which visitReferents() could
readily detect. However, this zeroing of memory was time consuming (and
wouldn't immediately be portable to some strange hardware architectures
in which null pointers are not represented by binary zero).

The approach finally adopted is simply for class GCNode
to keep a count of the number of infant nodes, and not to initiate a
mark-sweep garbage collection while any infant nodes exist. This has the
advantages of the second approach, but without the disadvantage: visitReferents()
will never be called for a node whose construction is incomplete, and
there is consequently no need for zeroing memory. It also simplifies the
handling of the case where an exception is thrown within the constructor
of an object derived from GCNode.

Other changes are as follows:

Templated class GCEdge<T> (which was abolished at
PhaseÂ 10) has been reinstated, and encapsulates the write barrier. RObject::Handle<T>
now inherits from GCEdge<T>.

The templated class GCRoot<T> has been renamed GCStackRoot<T>,
and its implementation simplified. These objects remain subject to the
restriction that they must be destroyed in the reverse order of their
creation, and are therefore best suited to declaration as automatic
variables (i.e. variables on the processor stack). A new templated class
GCRoot<T> has been introduced: this does a similar
job to GCStackRoot (i.e. it is a smart pointer providing
protection from garbage collection), but is not subject to
creation/destruction order restrictions. However, construction and
destruction of GCRoots is more time consuming than for GCStackRoots,
so the latter should be preferred where possible. CR's 'precious list'
has been reimplemented as part of the base class of GCRoot.
The ExitException class has been abolished, since the new
GCRoots make it unnecessary.

Class MemoryBank no longer contains any logic related to
garbage collection, and in particular there are no callbacks from MemoryBank
into the garbage-collection code. The decision about whether to initiate
a mark-sweep collection is now taken in GCNode::operator new.

Phase 21: 0.21-2.8.1

This phase changes the approach used for garbage collection. Previous
phases used a generational mark-sweep collector, like CR itself. As of
PhaseÂ 21, the principal method of garbage collection is reference
counting. The principal motivation for this is to make better use of the
processor caches: with reference counting, the memory occupied by objects
that become garbage is quickly recycled into productive use, very likely
while this memory is still mapped in cache.

To implement reference counting, each GCNode object
contains a one-byte reference count, which is automatically adjusted by
the GCEdge<T>, GCRoot<T> and GCStackRoot<T>
smart pointers, and by the traditional CR PROTECT/UNPROTECT
mechanism. (If a node's reference count ever reaches 255, it sticks at
that value, and that node can only be garbage-collected by the mark-sweep
mechanism.) When a GCNode's reference count falls to zero,
it is declared 'moribund'. When GCNode::operatorÂ new is
called upon to allocate memory for a new GCNode object, it
first looks through class GCNode's internal list of moribund
nodes. Any nodes on the list which still have a reference count of zero
are deleted; nodes whose reference count has risen back above zero -
accounting for about one in four of the nodes on the moribund list - are
returned to the 'live' list.

To cope with cycles in the node graph (i.e. the directed graph whose
nodes are GCNodes and whose edges are GCEdges),
this reference counting scheme is backed up by a simple (i.e.
non-generational) mark-sweep scheme. However, this runs much more rarely
than CR's garbage collections, and uses a simpler logic to manipulate the
threshold at which mark-sweep collection takes place. Not having node
generations means that there is no longer a need to implement the 'write
barrier'; this in turn means that the GCEdge<T>
templated class can have a C++ assignment operator defined, which enables
it to be more freely used in connection with the container types in the
C++ standard library.

Weak reference (WeakRef) objects need special handling
during garbage collection, and consequently each WeakRef
object now includes a pointer to itself, to stop it being deleted by the
reference counting mechanism.

Phase 22: 0.22-2.9.1

The purpose of this phase was to update CXXR to parallel release 2.9.1 of
CR. (Unfortunately, it was overtaken by release 2.9.2 of CR.)

uncxxr.h now defines a macro CXXRconvert(type,
expr), which expands to type(expr), but which uncxxr.pl
replaces simply by expr. This macro is now widely used in
code inherited from CR in cases where C++ requires an explicit type
conversion but C does not.

Phase 23: 0.23-2.9.2

The purpose of this phase was to update CXXR to parallel release 2.9.2 of
CR. This proved straightforward.

Phase 24: 0.24-2.9.2

This phase represented the first stage of refactoring the interpreter's
evaluation logic into C++, and included the following principal changes:

A class CXXR::Evaluator has been introduced to carry out
general services and housekeeping in support of evaluation. Rf_eval()
is now simply a wrapper round Evaluator::evaluate().

Class RObject now defines a virtual function evaluate(),
which Evaluator::evaluate() uses to evaluate a particular
object. By default this simply returns a pointer to the RObject
for which it was invoked, but this behaviour is overridden in various
classes (e.g. Expression, Symbol and Promise)
to provide substantive functionality.

The abstract class FunctionBase now defines an abstract
virtual function apply(), which is invoked by Expression::evaluate()
to apply a function to a specific set of actual arguments.

Class BuiltInFunction now has subclasses OrdinaryBuiltInFunction
(corresponding to SEXPTYPEBUILTINSXP) and SpecialBuiltInFunction
(SPECIALSXP). (It is possible that these classes will be
abolished in the future, with their respective functionalities - which
differ only slightly - being moved into BuiltInFunction.)

The functionality of BuiltInFunction::apply(), through
to the invocation of the appropriate do_ function, is now
fully handled within the CXXR core. do_internal() has also
been absorbed into the CXXR core. For the time being, however, Closure::apply()
is simply a wrapper round CR's Rf_applyClosure().

The function table, R_FunTab in CR, is now a private
static data member of class BuiltInFunction. This class
now uses a Schwarz counter, which automatically initialises the function
table on program start-up.

Phase 25: 0.25-2.9.2

This phase continued with refactoring the interpreter's evaluation logic
into C++, and comprised the following principal changes:

Closure::apply() has now been reimplemented within the
CXXR core, making use of a new class ArgMatcher to carry
out argument matching. For the time being the function Rf_applyClosure()
remains in existence, but it is now used only in connection with method
dispatch.

As presaged in the description of the preceding phase, classes OrdinaryBuiltInFunction
and SpecialBuiltInFunction have been abolished, and their
functionalities absorbed into BuiltInFunction.

A policy, described in the documentation of class RObject,
has been defined and put into practice regarding the use of const
T*, where T is RObject or a class
inheriting from it. This policy aims to resolve as far as possible an
inherent tension between the way CR is implemented and the
'const-correctness' that forms part of C++ programming style.

The code relating to weak reference (WeakRef) objects has
been improved and tidied up in various ways. In particular, when the key
object of a WeakRef is found to be unreachable, it is now
guaranteed that the weak reference's finalizer (if any) will be run as
part of the same mark-sweep garbage collection that collects the key.

Phase 26: 0.26-2.10.1

The purpose of this phase was to update CXXR to parallel release 2.10.1
of CR.

Phase 27: 0.27-2.10.1

This phase comprised the following principal changes:

SET_ENCLOS() has been superseded by new mechanisms for
manipulating the enclosing relationships of Environments,
which ensure that acyclicity is preserved.

A 'global cache' for Symbol bindings found along the
search list has been introduced, similar to that used in CR.

R_isMissing() reimplemented as CXXR::isMissingArgument();
unlike the previous CXXR implementation, it no longer requires any
memory allocations.

The GCNode class can now optionally include diagnostic
code to identify cycles within the GCNode/GCEdge
graph.

Phase 28: 0.28-2.10.1

This phase was concerned with refactoring contexts (CR's RCNTXT),
and involved teasing apart the numerous distinct functions that this struct
plays in CR:

Maintaining an 'Ariadne's thread' recording information about the
stack of R function calls currently in progress. This function is now
encapsulated in the CXXR class Evaluator::Context.

Conveying information about possible longjmp targets
from the destination to the point where longjmp is called.
C's setjmp and longjmp are incompatible with
C++ exception handling, and were removed from CXXR at Phase 8. At that
stage, however, they were simply replaced by an exception class JMPException,
which was designed simply to ape the behaviour previously achieved with
longjmp. JMPException has now itself been
abolished, and replaced with three exception classes LoopException
(servicing R functions break and next), ReturnException
(which services the R function return and various other
indirect flows of control) and CommandTerminated (raised
in response to unhandled errors or user interrupts). These new exception
classes are used in a way consistent as far as possible with C++
programming idioms; in particular, the class Evaluator::Context
plays no direct role in controlling their propagation, and the CR
function findcontext() no longer exists.

Saving information about the state of evaluation prior to an R
function call, and then restoring the state as the function exits
(whether via the normal flow of control or vialongjmp).
For the time being, this save/restore functionality has been retained
within the Evaluator::Context class, though in some cases
the functionality is achieved by incorporating an object of some other
class, such as ProtectStack::Scope or RAllocStack::Scope,
within an Evaluator::Context object.

In all cases this save/restore functionality is now achieved,
following a standard C++ idiom, by the constructor of a stack-based
object saving state, and then its destructor restoring it. This
automatically copes both with the normal flow of control and with
exceptions, so there is now no need for CR's R_restore_globals()
function.

In the future, it is likely that some of the save/restore functions
now carried out by the Evaluator::Context class will be
factored out into new classes with more specific responsibilities.

Saving information about R on.exit expressions. This
function is now also encapsulated within the Evaluator::Context
class. Any on.exit expressions attached to a Context
object are evaluated automatically by the object's destructor. This
automatically copes both with the normal flow of control and with
exceptions, so there is now no need for CR's R_run_onexits()
function.

Verifying that R functions effecting an indirect flow of control (e.g.
break, next and return) are
used only in circumstances where there is an appropriate destination. In
CXXR this is now accomplished using the classes Environment::LoopScope
and Environment::ReturnScope.

Determining whether execution is currently within an R browser, and if
so what the browsing depth is. In CXXR this is now accomplished using
the class Browser.

Other changes in this phase were:

The pending Promise stack has been abolished, the
necessary functionality now being achieved with C++ try-catch logic.

Several CR global variables have been abolished: R_RestartToken,
R_ReturnedValue and R_Toplevel. (CR's TOPLEVEL
contexts have been replaced by Evaluator objects.)

Phase 29: 0.29-2.10.1

The primary purpose of this release was to define the baseline for the
results on add-on packages reported at useR! 2010. The changes are mainly
bugfixes, but with the following more substantive changes:

The code now allows for the possibility that the destructor of a class
in the RObject hierarchy may evaluate R expressions. This
has entailed a change to the implementation of PairList::construct(),
which was previously not reentrant; in the new implementation, this
function never gives rise to garbage collection.

Methods of class RObject concerned with setting and
examining attributes are all now either virtual or implemented via
calls to virtual functions. This means that classes within the RObject
hierarchy can apply their own consistency checks to attribute settings,
and also override or augment the way in which attribute values are
stored within the C++ object.

Phase 30: 0.30-2.11.1

The primary purpose of this phase was to update CXXR to parallel release
2.11.1 of CR. This included the following corrections to significant
preexisting bugs:

Each of the functions COMPLEX(), INTEGER(),
LOGICAL(), RAW(), REAL(), R_CHAR(),
STRING_ELT(), SET_STRING_ELT(), VECTOR_ELT(),
SET_VECTOR_ELT(), XVECTOR_ELT() and SET_XVECTOR_ELT()
now verifies not only that its vector argument is a pointer to an RObject
of the correct type, but also that this argument is not a null pointer.
SET_STRING_ELT() also now verifies that the pointer to the
new String value is not null. These changes bring the
behaviour of these functions back into line with CR. These non-null
checks are applied even if CXXR is built with the preprocessor variable
UNCHECKED_SEXP_DOWNCAST defined (which causes the type
checks to be elided).

Changes have been made to ensure that do_browser()
correctly saves and restores the restart handler stack, and to ensure
that the browser can be invoked at top-level. (There is however still a
problem that typing Q into the browser does not work as
described in the manual page: it simply returns to the browser prompt.)

Phase 31: 0.31-2.11.1

This phase included extensive changes:

The process, started in Phase 28, of unbundling the various functions
of CR contexts continues. The Evaluator::Context class is
now the root of a hierarchy of classes. A Context object of some kind is
now created for every R function invocation (this no longer depends on
whether profiling is in progress), but the intention is that these
Context objects are lightweight, and contain only information relevant
to the particular function invocation.

In CR, indirect flows of control such as arise from the R return
and break functions are handled by C setjmp/longjmp.
Since these are incompatible with the orderly stack unwinding that C++
requires, at Phase 8 CXXR everywhere replaced invocations of longjmp
by throwing C++ exceptions. Unfortunately the propagation of C++
exceptions incurs a considerable overhead.

An R function such as return is now implemented so that
it creates an object of a class inheriting from Bailout.
The basic idea is that this object is then passed as a return value up
the chain from called function to caller, until it reaches the
intended destination of the indirect flow of control. However, this
passing up the call chain happens only if the caller has indicated, by
wrapping its call in a BailoutContext, that it is able
to propagate the Bailout object correctly. If that is
not the case, then the called function will invoke the throwException()
method of the Bailout object, which - as the name
suggests - will complete the indirect flow of control by throwing a
C++ exception.

This change has greatly reduced the number of C++ exceptions that are
thrown, with corresponding benefits for performance.

There has been continued refactoring of the central evaluation logic,
mainly with a view to making it clearer. This includes particularly the
dispatching of S3 methods. There has been some progress towards
concentrating all manipulations of argument lists in a new class ArgList.
Rf_applyClosure() and R_execClosure() have
been abolished, their functionality now being incorporated into the Closure
class. However much remains to be done.

The approach to running CXXR under Valgrind (with the memcheck tool)
has changed. Previously, CXXR optionally instrumented its own internal
memory allocation scheme (based on classes MemoryBank and
CellPool) using Valgrind client requests. This
instrumentation was controlled by the preprocessor variable VALGRIND_LEVEL.
Unfortunately the instrumented CXXR ran under Valgrind with glacial
slowness, making it useless for practical purposes. Under the new
approach, VALGRIND_LEVEL has been abolished. Instead, when
Valgrind (+memcheck) is to be used, the file MemoryBank.cpp
should be recompiled with the preprocessor variable NO_CELLPOOLS
defined, and CXXR rebuilt. (Only this one file needs to be recompiled.)
When NO_CELLPOOLS is defined, class MemoryBank
routes all requests for memory blocks directly to ::operator new
(which no doubt in turn calls malloc()). This means that
Valgrind's internal malloc() substitute comes into play,
and the result runs at an entirely usable speed.

CXXR has also been changed to carry out a more thorough clean-up at
program exit; in particular all objects of a class derived from GCNode
are deleted, and the tables of Symbols and CachedStrings
are deleted. This suppresses a lot of the 'possibly lost' reports that
Valgrind's leak check would otherwise report.

Phase 32: 0.32-2.11.1

This phase consisted of changes to improve the speed of CXXR. The
principal changes were as follows:

When the reference count of a GCNode falls to zero, it
is designated as 'moribund'. Previously moribund nodes were moved onto a
separate doubly-linked list of nodes (and moved back again if the
reference count was found subsequently to have risen). Now instead the GCNode
class maintains a vector of pointers to moribund nodes. Also, the
moribund flag within a GCNode object is now incorporated
into the same byte as the saturating reference count.

PairList objects have now been squeezed into 32 bytes (on
32-bit architecture) - with some resulting inelegances in encapsulation
- and Frame::Binding objects have been reduced to 16 bytes
(again on 32-bit architecture). Class CellPool now
allocates its 'superblocks' on 4096-byte boundaries. These changes make
for better utilisation of the processor caches.

A new class VectorFrame has been introduced, and used to
implement the local Environments of Closure
calls instead of the StdFrames used previously. As the
name suggests, VectorFrame is an implementation of the Frame
abstract type which holds its constituent Frame::Bindings
as a vector. Although look-up time is asympotically linear in the number
of Bindings, as compared with the logarithmic performance of StdFrame,
it has a shorter construction and destruction time than StdFrame,
and is better localised in memory. These factors make VectorFrame
more efficient in implementing small Frames with a short
lifetime.

Phase 33: 0.33-2.12.1

The purpose of this phase was to update CXXR to parallel release 2.12.1
of CR. In the course of this, the use of UncachedString
objects was largely replaced by the use of CachedString
objects, a change that has lagged behind the corresponding change in CR.

Phase 34: 0.34-2.12.1

This phase was marked by a wider use of C++ generic programming
techniques, both to simplify the internal code, and to make this code
available in a flexible form to add-on packages. In particular:

All the built-in vector types are now specialisations of the class
template FixedVector.

Subscripting operations (subsetting and subassignment) are now carried
out by algorithms implemented as C++ templates, so that they are
applicable to generalised vectors of arbitrary element types, not just
the R built-in vector types. (Class Subscripting and
associated functions.)

Similarly unary functions and binary functions are now handled
generically, using algorithms within the namespace VectorOps.

To support the generic algorithms, various function objects were
introduced in the new ElementTraits namespace.

Phase 35: 0.35-2.12.1

This release is intended to clear the decks prior to an upgrade to
R 2.13.1, and includes only small changes in the development trunk:

The class Subscripting has now been extended to cover
subassignment to matrices and arrays.

The implementation of class GCNode has been modified,
reducing its administrative data to a single byte.

(The main activity in the period leading up to this release has been the
introduction of the lazycopy branch, which is exploring
methods for managing object duplication automatically via the RHandle
smart pointer, and eliminating the need for NAMED() and SET_NAMED().
Verdict so far is mixed: it basically works, but has performance issues,
and breaks somewhat more existing code than I'd like. A plus point is that
it better achieves C++ 'const correctness' than the development trunk.)

Phase 36: 0.36-2.13.1

The purpose of this phase was to upgrade CXXR to parallel release 2.13.1
of CR. This includes making bytecode interpretation available in CXXR for
the first time, though not yet in the 'threaded code' implementation
(which is the CR default when using gcc).

The code also now builds correctly when configured with --enable-memory-profiling.
(Thanks to Doug Bates for pointing out that previously it didn't.)
However, the functionality of tracemem and kindred R
functions (untracemem and retracemem) is
currently unavailable in CXXR even when it is configured with memory
profiling enabled.

Phase 37: 0.37-2.13.1

This release contains only minor changes:

The functionality of tracemem and kindred functions has
been reinstated.

The 'threaded code' implementation of the bytecode interpreter is
available, and is the default under gcc (as in CR).

Various efficiency improvements, particularly regarding bytecode,
though much remains to be done here.

Phase 38: 0.38-2.13.1

This release clears the decks prior to an upgrade of CXXR to R 2.14.1.

The principal change regards garbage collection. The reference-counted
approach to garbage collection primarily used by CXXR can bring speed
advantages when dealing with large datasets, but the housekeeping involved
in diddling reference counts up and down as required is surprisingly
time-consuming, and this is a major contributor to the speed penalty of
CXXR compared with CR when dealing with small datasets, a penalty that has
grown greater with the advent of the bytecode interpreter. This release
incorporates the following changes:

Formerly CXXR would initiate a reference-count garbage collection (GCNode::gclite())
on every call to GCNode::operator new. This is still the
case if CXXR is built with the preprocessor variable AGGRESSIVE_GC
defined (as is the case in the default configuration), but otherwise gclite()
is invoked only when the number of bytes allocated has risen by a
certain margin (currently 10,000) since the previous call of gclite().

Smart pointers from the GCStackRoot class template are
now in either a non-protecting or protecting state, with newly created GCStackRoots
being non-protecting. Only if a GCStackRoot is in the
protecting state does it increment the reference count of its target. GCNode::gclite()
switches all GCStackRoots into the protecting state before
starting garbage collection. Taken in conjunction with the first change,
this means that many GCStackRoot pointers will complete
their lifecycle without ever being switched into the protecting state.

Changes in a similar spirit have been made to the CR-style 'pointer
protection stack' (class ProtectStack) and the bytecode
intepreter's node stack, both of which are now implemented using the new
class NodeStack.

A side effect of the above changes is that when AGGRESSIVE_GCis defined, CXXR's garbage collection is even more aggressive
than it was in previous releases, and this has revealed a number of
GC-protection gaps (e.g. in code inherited from CR) that had previously
'slipped through the net'.

Another significant change is that the CXXR distribution no longer holds
the 'Recommended' packages in compressed tar form (.tar.gz),
but instead contains the untarred package directories themselves. This
will make it easier to carry forward any CXXR-specific tweaks to these
packages from one R release to the next. (Such tweaks are rare, and often
due to a latent GC-protection bug in the CR package code.)

Phase 39: 0.39-2.14.1

The purpose of this phase was to upgrade CXXR to parallel release 2.14.1
of CR. This entailed substantial changes to the bytecode interpreter, both
to track changes in CR and to correct errors in the previous CXXR
implementation. In the course of preparing this release, numerous
GC-protection gaps were discovered in the CR code (including the
Recommended packages) and corrected within CXXR.

CXXR's bytecode interpreter does not yet implement the cache of symbol
bindings used in CR.

Phase 40: 0.40-2.15.1

The purpose of this phase was to upgrade CXXR to parallel release 2.15.1
of CR. In the course of this upgrade, the class UncachedString
was abolished, and the functionality of class CachedString
was merged into its parent class CXXR::String.

Phase 41: 0.41-2.15.1

In this phase, the experimental provenance-tracking facilities and the
experimental XML-based serialization facilities, both formerly in the provenance
branch, have been merged into the development trunk. Beware that
documentation and in particular the testing of these features is still not
up to standard, and there are known gaps in the serialization capability.
Moreover the interfaces of both are likely to change. To enable
provenance-tracking it is necessary to define PROVENANCE_TRACKING
within src/include/CXXR/config.hpp before building the
program, as the documentation of this file explains.

Phase 42: 0.42-2.15.1

This phase saw various extensions and corrections to the XML-based
serialization facilities, including the introduction of automated tests,
but beware that these are still subject to change. The release
incorporates work by Chris Silles on adapting the autoconf-based
configuration facilities to CXXR: this addresses particularly locating a
suitable installation of Boost, and enabling or disabling provenance
tracking. Previously there were some difficulties in building CXXR
otherwise than in its source directory: these have now, it is hoped, been
removed.