Genetic programming (GP) is a systematic, domain-independent method for
getting computers to solve problems automatically starting from a high-level
statement of what needs to be done. Using ideas from natural evolution, GP
starts from an ooze of random computer programs, and progressively refines them
through processes of mutation and sexual recombination, until solutions emerge.
All this without the user having to know or specify the form or structure of
solutions in advance. GP has generated a plethora of human-competitive results
and applications, including novel scientific discoveries and patentable
inventions.

TinyGP is a highly optimised GP system that was originally
developed by Riccardo Poli to meet the specifications set out in the
TinyGP competition of the Genetic and Evolutionary Computation
Conference (GECCO) 2004. We include it as a working example of a real
GP system, to show that GP software tools are not necessarily big,
complex and difficult to understand. The system can be used as is or
can be modified or extended for a user's specific
applications. Furthermore, TinyGP may serve as a guide to other
implementation of genetic programming.

The source code of TinyGP is available here. The following section provides a description of the main
characteristics of TinyGP. Section 2
describes the format for the input files for TinyGP. Section 3 provides further details on the
implementation and the source code for a Java version of
TinyGP. Finally, Section 4 describes a
sample run of the system.

TinyGP is a symbolic regression system with the following characteristics:

The terminal set includes a user-definable number of floating point
variables (named X1 to XN).

The function set includes multiplication, protected division, subtraction and addition.

The fitness cases are read from a file (the format is given below).

The system is steady state. A "generation" is considered concluded when POPSIZE (see below) crossover/mutation events have been performed.

Selection is performed using tournament selection.

Negative tournaments are used for the selection of the individuals to
be replaced at each steady-state-GP iteration.

Subtree crossover is used. The selection of crossover points is uniform,
so every node is chosen equally likely.

Point mutation is used. That is, points (nodes) in the tree are randomly chosen. If a point is a terminal, then it is replaced by another randomly chosen terminal. If it is a function, then it is replaced by another randomly chosen function with the same number of inputs.

The following parameters are implemented as static class variables:

The maximum length any GP program can take: MAX_LEN.

The size of the population: POPSIZE.

The maximum depth initial programs can have: DEPTH. Note 0
represents the depth of programs containing just one terminal.

The maximum number of generations allowed for a run:
GENERATIONS.

The
probability
of
creating
new
individuals
via
crossover:
CROSSOVER_PROB.
The
mutation
probability
is 1-CROSSOVER_PROB.

The mutation probability (per node) when point mutation is cho-
sen as the variation operator: PMUT_PER_NODE.

The tournament size: TSIZE.

The parameters and the random seed are printed when each run starts.

The fitness function is minus the sum of the absolute differences between the actual program output and the desired output for each fitess case. TinyGP maximises it.

The grow initialisation method is used to create the initial population.

At each generation the following statistics are calculated and printed:

The generation number.

The average fitness of the individuals in the population.

The fitness of the best individual in the population.

The average size of the programs in the current generation.

The best individual in the population.

The random number
generator can be seeded via the command line. If this command line
parameter is absent, the system uses the current time to seed the
random number generator.

The name of the file containing the
fitness cases can be passed to the system via the command line. If
the command line parameter is absent, the system assumes the data are
stored in the current directory in a file called "problem.dat".

If the total error made by the best program goes below 0.00001
TinyGP prints a message indicating success and stops. If the problem
has not been solved after the maximum number of generations, it prints
a message indicating failure and stops.

The input files for TinyGP have the following plain ASCII format:
HEADER
// See below
FITNESSCASE1 // The fitness cases (one per line )
FITNESSCASE2
FITNESSCASE3
....

Each fitness case is of the form
X1 ... XN TARGET
where X1 to XN represent a set of input
values for a program, while TARGET represents the desired
output for the given inputs.

The header has the following entriesNVAR NRAND MIN_RAND MAX_RAND NFITCASES
where NVAR is an integer representing the number of
variables the system should use, NRAND is an integer
representing the number of random constants to be provided in the
primitive set, MIN_RAND is a float representing the lower
limit of the range used to generate random constants,
MAX_RAND is the corresponding upper limit, and
NFITCASES is an integer representing the number of
fitness cases. NRAND can be set to 0, in which case
MIN_RAND and MAX_RAND are ignored.

The source code of TinyGP is available here. The original TinyGP system was implemented, in the C programming
language, to maximise efficiency and minimise the size of the
executable. The version presented here is a Java re-implementation of
TinyGP. The original version did not allow the use of random numerical
constants.

How does TinyGP work? The system is based on the standard flattened
(linear) representation for trees, which effectively corresponds to
listing the primitives in prefix notation but without any
brackets. Each primitive occupies one byte. A program is simply a
vector of characters. The parameters of the system are as specified in
Section 1. They are fixed at compile
time through a series of static class variables assignments. The
operators used are subtree crossover and point mutation. The selection
of the crossover points is performed at random with uniform
probability. The primitive set and fitness function are as indicated
above. The code uses recursion for the creation of the initial
population (grow), for the identification of the subtree rooted at a
particular crossover point (traverse), for program execution (run),
and for printing programs (print indiv). A small number of global
variables have been used. For example, the variable program is a
program counter used during the recursive interpretation of programs,
which is automatically incremented every time a primitive is
evaluated. Although using global variables is normally considered bad
programming practice, this was done purposely, after extensive
experimentation, to reduce the executable's size. The code reads
command line arguments using the standard args array.

Generally the code is quite standard and should be self-explanatory for
anyone who can program in Java, whether or not they have implemented a GP system before. Therefore very few comments have been provided in the source code.
The source is provided below. The program should be compiled with the
command javac -O tiny gp.java