What Are They?

In an evolutionary algorithm (EA), we search the solution space (the set of all possible inputs) of a difficult problem for the best solutions, but not naïvely like in a random or brute-force search. We use some biological principles to help guide the search:

Reproduction

New solutions don't just appear out of thin air (or our random number generator) -- we combine existing good solutions to come up with new (hopefully even better) solutions.

Mutation

Just like in real life, mutations can help or hurt. But they keep the gene pool from becoming stagnant and homogenous -- preventing EA "inbreeding".

Survival of the fittest

The "good" solutions in the population are the ones we pick to pass down their genes. We eliminate the poor solutions from consideration.

EA is an umbrella term encompassing ideas used in the more specific fields of genetic programming, genetic algorithms, and evolutionary computing, so the concepts covered here apply to those fields as well.

Armed with just these principles, you could implement your own rudimentary (and working) EAs. You may already have implemented something like this before. However, as they say, the devil's in the details. It's important to understand how the implementation details affect your EA.

Why/When Use EAs?

You only need to specify the goal of the computation, and need not have a complete mathematical understanding of the problem.

Works well with incomplete and noisy data.

Easy and cheap to implement.

Easily adaptable by simply modifying parameters.

Efficiently parallelized.

On top of all that, EAs are refreshingly fun to use compared to other forms of analysis, and they leave plenty of room for creativity. Here are a few examples of real-world applications that I know about:

Finding circuit layouts that minimize production costs, using graph representations.

Creating classifiers and predictors for data sets, often with neural nets.

Generating music/art that satisfies certain aesthetic criteria.

The Basics:

In EAs, we work with a population of individuals, data structures that represent elements of the problem's solution space. When we talk about representation, we mean the type of internal data structure the EA uses to store the individuals (array, string, tree, neural net, etc). Representation does matter, but for the scope of this document's examples, strings and arrays will suffice. As they are easily available as native data types in any sane language, they are much easier to implement and conceptualize. The actual encoding of a solution into a data structure is called the individual's gene. We'll talk a little more about representation later.

Fitness:

If our goal is to find individuals which represent "good" solutions, we should probably be a little more specific about what we mean by "good." We must have a way of scoring an individual's effectiveness for the problem. We call this measurement the individual's fitness (as in survival of the fittest). The fitness measure should reflect the characteristics you desire in a "good" solution, so the higher an individual's fitness, the better it demonstrates the traits you want. The fitness measure is always dependent on the representation you use. It sets your EA's goal.

Commonly, fitness is just a function of the individual's gene data structure.
However, the fitness measure need not be a true function in the mathematical sense. It might be probablistic, or it might depend also on other members of the population. It also often involves a model or simulation of the problem, executed with the individuals of the population.

The Process:

The most basic evolutionary algorithm psuedocode is rather simple:

Create an initial population (usually at random)

Until "done": (exit criteria)

Select some pairs to be parents (selection)

Combine pairs of parents to create offspring (recombination)

Perform some mutation(s) on the offspring (mutation)

Select some population members to be replaced by the new offspring (replacement)

Repeat

This is extremely general, so let's look at each step in a little more detail. As you can imagine, there are a zillion ways to fill in the implementation details of each step, so I'll only list the most common ones.

The exit criteria sets the target for the fitness measure, but also usually includes an upper limit on the number of iterations, in case the evolution gets "stuck." A typical exit criteria might be: "stop when some individual achieves a fitness of 100, or when we have iterated 10,000 times." We'll talk more about evolution getting "stuck" later. Sticking with the biology jargon, each iteration of the loop is called a generation.

Selection and replacement grant breeding rights and cause extinction within the population, respectively. They are independent of the representation scheme, and should only rely on your choice of fitness measure. Usually a small fraction of the population are chosen for breeding or replacement each generation. For simplicity, often the same number of individuals are chosen for breeding and replacement, although this is not required (causing the population to change in size). Here are a few of the most common selection and replacement methods:

Random: Choose completely at random, with uniform probability given to each individual (regardless of fitness).

Roulette: Pick randomly, but with relative weights proportional to fitness. Higher-fit individuals have a better chance of getting chosen for breeding, and less-fit individuals have a better chance of getting chosen for replacement

Rank: Same as roulette, but make the relative weights proportional to an individual's rank within the population, not fitness. The least-fit individual has rank 1, while the most-fit has rank N (the size of the population).

Selection and replacement methods are independent of each other. You could use absolute replacement with rank selection, for example.

Recombination (or breeding) is the process of using existing pairs of "parent" genes to produce new "offspring" genes. The details of this operation depend on your representation scheme, but by far the most common recombination operation is called crossover. Crossover can be used with string and array representations. It involves making copies of the parents and then swapping a chunk between the copies. Here's a visual example on two string genes:

parent 1: "aaaaaaaaaaaaaa" (these are only copies -- the parents a+re
parent 2: "AAAAAAAAAAAAAA" not modified in this operation)
cut here: ^ ^ (these two points chosen at random)
and then swap sections..
result 1: "aaaaaaAAAAAAaa"
result 2: "AAAAAAaaaaaaAA"

The concept of crossover can be extended and used in other representations as well. For instance, a crossover operation on two tree structures might involve the exchange of two subtrees. Common variations on crossover include swapping chunks from different parts of the two genes or exchanging more than one chunk.

Mutation is a random process which slightly modifies the gene of an individual. With string genes, a mutation usually consists of changing a fixed number of characters, or changing each character with a very low probability (e.g, a 5% chance of changing each character). Other interesting mutations include lengthening, shortening, or modifying the gene, each with a respective probability.

A "Hello World" Example:

It's time for an example! The "Hello World" of EAs is called ONE-MAX. The problem is to produce a string of all 1s, starting from a population of random binary strings. This may seem silly since we already know the final answer, but the same could be said for "Hello World" programs in programming languages. In this case, it's the EA process and the concepts, not the final answer, that are most important. Using a string representation for the genes, the code is as follows:

Can you pick out which parts of this code correspond to the parts of the pseudocode?
What type of mutation was used (N-point or probabalistic)? What type of selection and replacement scheme were used? What percentage of the population gets breeding rights at each generation? What is the exit criteria? How could this code be made more efficient (there are many ways)? How could the EA process be modularized? How much harder would this have been to write in C or Java? ;)

Now What? How Do I Choose?

You now probably have a feeling for the wide range of EA building blocks. But there are so many, how will you choose what's best for a particular problem? What makes them different? It's time for a little theory...

Fitness Landscapes & Diversity:

One way to think of how EAs solve problems is through hill-climbing. Think of breeding as a process of exploring the solution space: starting with high-fitness individuals, recombination and mutation bring new individuals into the population, whose genes are "nearby" the genes of the parents. Selection and replacement fuel the up-hill part: the new individuals who have a higher fitness will in turn be explored while the lower ones will eventually be discarded and so on, until you discover individuals that have the highest fitness of all nearby individuals -- they are at the top of that particular "hill" of fitness. Notice that "nearness" of other individuals is measured in the number of mutations and/or recombinations needed to get from here to there. So your choice of mutation and recombination operators determines the fitness landscape.

On one hand, hill-climbing casues EA populations is to slowly cluster near the tops of these hills as they try to achieve maximum fitness. When most of the population's members are very close to one another (very few mutations or crossovers apart), their genes are very similar, they have much genetic material in common, and we say the population is not diverse. Hill-climbing is desired (we do want to maximize fitness after all), but only in moderation. If it happens too fast, it's easy for the whole population may become "stuck" on a small number of fitness hills that are not the highest in the solution space. Mathematically speaking, these are local optima.

On the other hand, when the population is diverse and spread out in the landscape, you may combine two "distant" parents to get a child somewhere in the middle, maybe on a new fitness hill. This allows for more fitness hills to be discovered, reducing the chance of getting stuck on a local optima.

(You may have noticed that in the ONE-MAX example, there are none of these. There's only one fitness hill, with the string of all 1s at the top. Its fitness landscape is a 20-dimensional hypercube. Mutation moves along one or more edges of the cube, and crossover moves to any vertex along the subcube induced by the parents. Non-trivial problems generally have fitness landscapes that are too complex to characterize.)

Here is how diversity is affected in general by the different operations:

The recombination-replacement process decreases diversity

When we reuse the good genes and throw out the bad genes, the population tends to approach a stale and homogenous state. Since we only reuse genetic material already existing in the population throw out some information, we can never create new genetic material.

Mutation is the only process that increases diversity

It's the only way new genetic material is introduced throughout the EA process. Without mutation, there would be no way to achieve gene configurations that aren't merely rearrangements of the initial population's genes. However, mutations that are too large or too frequent can make individuals "jump" between fitness hills. We want mutation to move individuals a small enough distance that they usually stay on the same fitness hill.

If you select and replace individuals completely at random (no bias towards fitness), it will take a long time for the population to become homogenous (but it will still happen eventually). On the other hand, if you always breed the most-fit individuals and always replace the least-fit (no randomness, 100% fitness bias), then only the best-fit individuals (and their presumably high-fitness children) keep producing children generation after generation. The high-fitness genetic material gets reused for breeding over and over again, while the lower-fitness genetic material never has a chance to be reused.

Population diversity is needed to ensure a many fitness hills are encountered. But eventually diversity must be sacrificed so that good solutions can "climb the hill" to maximize fitness.

Representation Matters, Too!

Mutation and recombination (and therefore the fitness landscape) rely on your choice of representation scheme. The representation should therefore make mutation and recombination behave like the biological concepts they represent. For instance, a small change in an individual's gene should make only a small to moderate change in its fitness characterstics. Likewise, combining parts of the genes of two individuals should produce an individual that shares some of its parents' characterstics. However, the result need not be merely an average of the parents; there may be synergy between different parts of the genes.

In solving difficult problems with EAs, finding a good representation scheme with good recombination and mutation operations can often be the hardest piece of the puzzle. There is no magic advice for choosing the "right" representation, and in addition to adhering to these guidelines, the choice must be feasible to implement.

Some Final Notes

EAs are a lot of fun, especially when you are modelling some real-world situation. Because at times EAs seem to work like magic, it's easy to forget that they have the same limitations (in terms of complexity) as any computer model. Do expect interesting results, but don't expect sentient digital lifeforms to crawl out of the primordial ooze of your 1s and 0s.

I think you'll enjoy working with evolutionary algorithms, as they're a bit of a departure from classical computation/analysis methods. Hopefully this guide will give you the background needed to have a lot of fun tinkering with EAs. Be creative and inventive, and happy evolving!