Shortcut to this page: ntrllog.netlify.app/ai2
Notes provided by Professor Armando Beltran (CSULA) and Introduction to Evolutionary Computing
This is an extension of the artificial intelligence and machine learning resources. Well, we’ll see where this goes — I’m writing this as the class goes.
As computer scientists, we aim to develop and use AI to solve problems optimally. But before we look at how AI works, we should start with understanding what types of problems exist and how to define them.
Some problems can be reduced to a "black box" model. The "black box" model has `3` components: an input, a model (algorithm or function), and an output. If one of the three components is unknown, then that is a problem we can define and solve.
If the input is unknown, then we have an optimization problem. As the name implies, optimization problems seek to maximize or minimize a result.
For example, we’re traveling and we want to visit a bunch of destinations while minimizing time traveled, i.e., we want an efficient route so that we don’t waste time. The model is a function that calculates the time it takes to go from one destination to another. The output is a number representing the total time traveled (we want to minimize this). The input — which is what we want to solve for — is a route.
Another example is the `8`-queens problem. We want to place `8` queens such that no queens are attacking each other. The model is a function that calculates how many queens are attacking each other. The output is the number of queens attacking each other (we want to maximize this). The input — which is what we want to solve for — is a configuration of `8` queens.
As mentioned earlier, optimization problems aim to maximize or minimize a result. That result is calculated by a function, namely an objective function. An objective function is a math expression that quantifies the quality of a potential solution to an optimization problem. (The quality of a solution is how well it solves the problem.)
In other words, an objective function assigns a numerical value to an input, and that numerical value is a measure of how good that input is as a solution to the problem.
If the model is unknown, then we have a modeling problem. In these types of problems, we have the inputs and outputs, and we want to know what type of model fits the data.
Let’s say we have these inputs and outputs:
x | y |
---|---|
`1` | `2` |
`2` | `4` |
`3` | `6` |
`4` | `8` |
`5` | `10` |
In this case, we can figure out that the model is the function `y = 2x`.
A real-world example is voice recognition. We have the inputs (a set of spoken text) and the outputs (a set of written text), and we want to build a model that can match the spoken text to the written text.
Modeling problems can be transformed into optimization problems by picking a model that works and then optimizing that model.
If the output is unknown, then we have a simulation problem. These types of problems are like exploring "what if" scenarios.
For example, predicting the weather is a simulation problem. The inputs are data points for things like humidity and temperature. The model is a function that takes in those data points and returns a weather. "What if it is humid and hot?"
Honestly, simulation problems don’t really sound like "problems" at all. All we need to do is run the model on the inputs and get our answer.
On the other hand, for optimization and modeling problems, we have to search for something. For optimization problems, we have to find the input that gives us the desired output. For modeling problems, we have to find the model that gives us the desired output for the given inputs.
Formally, there is a search space, which is a set of objects of interest, including the desired solution. It can typically be represented as a search tree.
For the time-efficient traveling problem, the search space is a bunch of possible routes and we have to find one that is the most time-efficient. For the voice-recognition problem, the search space is a bunch of models and we have to find one that is the most accurate.
Typically, the search space for any problem will be very large, so we need an efficient way to explore the search space for the desired solution. A problem solver does this for us. A good problem solver goes through the search space and finds the optimal path through the search tree to the desired solution.
Some problems are inherently hard for computers to solve. This could be because it takes a really long time and/or a lot of memory is needed to find the solution. The difficulty of a problem can be classified into four categories:
These classifications (except for NP-hard) apply to decision problems: problems that have a "yes" or "no" answer. For example, "Is `A rarr B rarr C rarr D` the optimal route?"
The `n`-queens problem is an optimization problem that involves placing `n` queens on a chessboard such that none of the queens are attacking each other. For example, for `n=4`:
To formalize things, we can frame the problem statement as, "Given `n` queens, find a configuration for the queens on an `n xx n` chessboard that maximizes the number of pairs of non-attacking queens."
If we label the top-left corner as row `1`, column `1`, then we can represent a board configuration using coordinates. So for the example above, we have
`Q = {(2,1), (4,2), (1,3), (3,4)}`
The objective function `f` can be represented as
`max_Q f(Q) = (n(n-1))/2-A(Q)`
where `A(Q)` is the number of pairs of attacking queens for a board configuration `Q`.
`A(Q)` is the number of pairs of attacking queens (as opposed to the number of queens being attacked). In this example, there are `4` pairs of attacking queens.
`(n(n-1))/2` is the number of total possible pairs of queens. If we subtract the number of pairs of attacking queens from that expression, we get the number of pairs of non-attacking queens (which is what we're trying to maximize).
Now that we have formally defined the problem, we can start to look at how hard this problem is to solve. For optimization problems, the difficulty depends on the size of the search space. The more inputs there are to search through, the harder the problem is.
So how many possible board configurations are there? We'll first consider the `n=4` case.
For the first queen, there are `16` spaces for where it can be placed. For the second queen, there are `15` spaces. For the third queen, there are `14` spaces. For the last queen, there are `13` spaces. So there are a total of `16*15*14*13=43,680` possible board configurations.
In general, there are
`(n^2!)/((n^2-n)!)`
possible board configurations.
For a standard `8xx8` chessboard, there are `1.78 xx 10^(14)` possible board configurations. Yeah, that's a lot to search through.
Is it possible to make the search space smaller? Well, notice that if more than `1` queen is on the same row, then they're automatically attacking each other.
So any board configurations where there is more than `1` queen in the same row can be removed from the search space. We can achieve this by reframing the problem statement as, "Given `n` queens, find a configuration for the queens on an `n xx n` chessboard such that every row contains exactly one queen while maximizing the number of pairs of non-attacking queens." We're introducing a constraint on the original problem.
Also, instead of using coordinates to represent a board configuration, we can use a vector representation.
`Q = [Q_1, Q_2, ..., Q_n]^T`
where index `i` represents the row number and `Q_i` is the column number.
`Q = [3, 1, 4, 2]^T`
By using a vector, we force all possible board configurations to have exactly `1` queen in every row.
Let's see how much the search space has been reduced by with this new constraint. Again, we'll first consider the `n=4` case.
The first queen can only be placed in the first row, and in the first row, there are only `4` spaces for it. The second queen can only be placed in the second row, and in the second row, there are only `4` spaces for it. The third queen can only be placed in the third row, and in the third row, there are only `4` spaces for it. The last queen can only be placed in the fourth row, and in the fourth row, there are only `4` spaces for it. So there are a total of `4*4*4*4=256` possible board configurations.
In general, there are
`n^n`
possible board configurations.
So we went from `43,680` to `256`, which is pretty good. For the `n=8` case, there are `16,777,216` possible board configurations, which is much less than `1.78 xx 10^(14)`, but still a lot.
We can apply the same row-constraint reasoning to the columns as well: if there is more than `1` queen on the same column, then they're automatically attacking each other.
So any board configurations where there is more than `1` queen in the same row and column can be removed from the search space. We can achieve this by introducing another constraint: "Given `n` queens, find a configuration for the queens on an `n xx n` chessboard such that every row and column contains exactly one queen while maximizing the number of pairs of non-attacking queens."
Mathematically, for all `1 le i,j le n`, `Q_i != Q_j` and `i != j`.
Let's see how much the search space has been reduced by with this new constraint. Again, we'll first consider the `n=4` case.
The first queen can only be placed in the first row, and in the first row, there are only `4` spaces for it. The second queen can only be placed in the second row, and in the second row, there are only `3` spaces for it. The third queen can only be placed in the third row, and in the third row, there are only `2` spaces for it. The last queen can only be placed in the fourth row, and in the fourth row, there is only `1` space for it. So there are a total of `4*3*2*1=24` possible board configurations.
In general, there are
`n!`
possible board configurations.
For the `n=8` case, there are now only `4032` board configurations in the search space.
Introducing constraints on the problem made the `n`-queens problem much easier to solve. However, even with the constraints, it is still a hard problem to solve, especially when `n` gets larger. Here's a table with the size of the search space for each version of the problem.
`n` | free optimization | `1` queen per row | `1` queen per row and column |
---|---|---|---|
`4` | `43680` | `256` | `24` |
`8` | `1.78643 xx 10^14` | `16777216` | `40320` |
`16` | `2.10876 xx 10^38` | `1.84467 xx 10^19` | `2.09228 xx 10^13` |
`32` | `1.30932 xx 10^96` | `1.4615 xx 10^48` | `2.63131 xx 10^35` |
`64` | `9.4661 xx 10^230` | `3.9402 xx 10^115` | `1.26887 xx 10^89` |
So why are we even looking at the `n`-queens problem in the first place? It looks like a trivial puzzle.
Well, it turns out that the `n`-queens problem has practical applications in things like task scheduling and computer resource management.
Going through each possible board configuration to find a valid one is not practical. It would be better to programmatically generate solutions.
One way of doing this is through evolutionary computing. Evolutionary computing is a research area within computer science that draws inspiration from evolution and natural selection.
Nature is apparently a good source of inspiration. The best problem solver known in nature is the human brain (basis of neurocomputing) and the evolutionary mechanism that created the human brain (basis of evolutionary computing).
The metaphor here is that the problem is like the environment; a candidate solution is like an individual; and the quality of the candidate solution is like the individual's "fitness" level.
The population consists of candidate solutions and the best ones are selected for "reproduction". This produces new solutions (containing the traits from both of the parents), and those new solutions are modified slightly to maintain diversity. And this process repeats.
(image borrowed from my professor)
The population is initialized by selecting a random subset of the search space.
There are two inherent, competing forces in evolutionary algorithms. The process of recombination (reproduction) and mutation promotes diversity and represents a push towards novelty. The process of selecting the fittest parents or survivors decreases diversity and represents a push towards quality. By going back and forth between these two forces, evolutionary algorithms balance exploration and exploitation to find optimal results.
Balancing exploration and exploitation is crucial to avoid inefficiency or premature convergence. Premature convergence is the effect of losing population diversity too quickly and getting trapped in a local minimum.
Recall that earlier we were representing board configurations using coordinates and vectors (permutation encoding). In biological terms, what we were doing was mapping phenotypes (real-world solutions) to genotypes (encoded solutions).
A phenotype is the set of observable characteristics of an organism. A genotype is the complete set of genetic material of an organism.
Continuing with the biology theme, a solution is called a chromosome. A chromosome contains genes, which contain the values of the chromosome. The actual values are called alleles.
Loosely, genes are like the slots in a chromosome and the alleles are the values in those slots.
Once we have encoded representations of the inputs, we can programmatically assess their quality. This is where the objective function we saw earlier comes in.
In the phenotype space, the function is called an objective function. In the genotype space, the function is called a fitness function.
The fitness function assigns a numerical value (referred to as fitness value) to each phenotype so that the algorithm can know which ones to select for recombination. Typically, the "fittest" individuals have high fitness values, so we seek to maximize fitness.
Formally, a population is a multiset of candidate solutions. This means that repeated elements are allowed to exist in the population.
A population is considered to be the basic unit of evolution. So a population evolves, not single individuals.
Diversity of a population can refer to either fitness diversity or phenotype/genotype diversity.
The selection mechanism compares the fitness values of the solutions to decide which ones become parents.
Even though the general goal is to push the population towards higher fitness, the selection mechanism is usually probabilistic. Higher-quality solutions are more likely to be selected, but sub-optimal (or even the worst) solutions can still be picked.
Even though it might seem like it makes the most sense to just pick the solutions with the highest fitness values, it's sometimes necessary to not do that. Evolutionary algorithms can get stuck in a local maximum, and the only way to get out would be to pick sub-optimal solutions. Biologically, this is referred to as "genetic drift".
An example of a selection mechanism is a roulette wheel. Suppose we have three solutions `A, B, C`. `A` has a fitness of `3`; `B` has a fitness of `1`; `C` has a fitness of `2`. We can put them on a roulette wheel like so:
This way, `A`, which has the highest fitness value, has the highest chance of being selected.
The selection mechanism is also responsible for replacing the elements in the population with survivors.
While parent selection is usually probabilistic, survivor selection is usually deterministic. The process can be fitness based (compare the fitness values of the parents + offspring) or age based (replace as many parents as possible with offspring).
Once the parents are selected, the offspring need to be produced. Recombination is the process of merging information from the parents into the offspring. The choice of what information to merge is random, so it's possible that the offspring can be worse or the same as the parents.
The algorithm has to stop eventually. There are several conditions we can use to determine when to stop:
Variation operators are genetic information manipulators that generate new candidate solutions. Mutation and recombination are the typical variation operators that are used in evolutionary algorithms. Variation operators are necessary for introducing diversity into the population.
Suppose that we are encoding our solutions using strings of `1`s and `0`s (the genotype space is the set of strings of binary digits).
One problem we would do this for is the knapsack problem, in which we are deciding which items to put in a knapsack to maximize item value/count and minimize wasted bag space. A `1` would mean put the item in the knapsack and a `0` would mean don't.
One simple mutation we can do is to randomly flip bits. For example, `10100 rarr 00010`.
Formally, each bit has a probability `p_m` of flipping. `p_m` is referred to as the mutation rate. It's typically between `1/text(population size)` and `1/text(solution length)`.
Using a lower probability preserves high-fitness individuals while using a higher probability encourages exploration (but risks disrupting good solutions). Generally, if we want to optimize overall population fitness, then we use a lower probability. If we want to find a single high-fitness individual, then using a higher probability can be better.
There's an interesting problem with using binary. Consider the numbers `6`, `7`, and `8`, which are `0110`, `0111`, and `1000` in binary respectively.
It takes only `1` bit flip to go from `0110` to `0111`, but it takes `4` bit flips to go from `0111` to `1000`. As a result, the probability of changing `7` to `8` is much less than the probability of changing `6` to `7`.
Gray coding, which is a variation on the binary representation such that consecutive numbers differ in only one bit, helps with this problem.
One simple crossover technique we can do is to choose a random point and swap the parents' info at that point to produce the children.
`0` | `0` | `0` | `0` | `0` | `0` | `0` |
`1` | `1` | `1` | `1` | `1` | `1` | `1` |
Using the red line as the crossover point, we get
`0` | `0` | `0` | `0` | `1` | `1` | `1` |
`1` | `1` | `1` | `1` | `0` | `0` | `0` |
There is a positional bias when using a `1`-point crossover. Genes that are near each other are more likely to stay near each other and genes from opposite ends of the string can never be kept together.
Instead of picking just `1` split point, we can pick `n` split points.
`0` | `0` | `0` | `0` | `0` | `0` | `0` | `0` | `0` | `0` | `0` | `0` | `0` |
`1` | `1` | `1` | `1` | `1` | `1` | `1` | `1` | `1` | `1` | `1` | `1` | `1` |
produces
`0` | `0` | `0` | `0` | `1` | `1` | `1` | `0` | `0` | `0` | `0` | `0` | `1` |
`1` | `1` | `1` | `1` | `0` | `0` | `0` | `1` | `1` | `1` | `1` | `1` | `0` |
Another type of crossover we can do is to randomly pick genes from each parent. For the first child, each gene has a `50%` chance to come from one of the parents. Then the second child is created by flipping each gene of the first child.
`0` | `1` | `0` | `0` | `1` | `0` | `1` | `1` | `0` | `0` | `0` | `1` | `0` | `1` | `1` | `0` | `0` | `1` |
`1` | `0` | `1` | `1` | `0` | `1` | `0` | `0` | `1` | `1` | `1` | `0` | `1` | `0` | `0` | `1` | `1` | `0` |
Now let's look at variation operators for solutions that are encoded using more numbers than just `1` and `0`.
An example of a problem where we would use this type of encoding is graph coloring, where the goal is to color each vertex so that no two vertices share the same color. We would use numbers to represent each color.
For each gene, we could add a small (positive or negative) value with some probability `p`. We could also replace the gene with a random value with some probability `p`.
The same recombination techniques we saw for the binary representation can be used for the integer representation as well.
Some problems, like the `n`-queens problem, require that solutions do not contain repeated values. For these types of problems, the order in which events occur matters a lot, like planning a route or scheduling jobs.
We can choose `2` alleles and swap them.
`1` | `3` | `5` | `4` | `2` | `8` | `7` | `6` |
`darr`
`1` | `3` | `7` | `4` | `2` | `8` | `5` | `6` |
We can choose `2` alleles and move the second one up to the first one.
`1` | `2` | `3` | `4` | `5` | `6` | `7` | `8` |
`darr`
`1` | `2` | `5` | `3` | `4` | `6` | `7` | `8` |
We can pick a subset of genes and randomly rearrange the alleles in those positions.
`1` | `2` | `3` | `4` | `5` | `6` | `7` | `8` |
`darr`
`1` | `3` | `5` | `4` | `2` | `6` | `7` | `8` |
These types of mutations (swap, insert, scramble) are preferred for order-based problems, i.e., problems where the order of events are important.
We can pick `2` alleles and invert the substring between them.
`1` | `2` | `3` | `4` | `5` | `6` | `7` | `8` |
`darr`
`1` | `5` | `4` | `3` | `2` | `6` | `7` | `8` |
This preserves most of the adjacency information (since it only breaks `2` links) but it is disruptive for the order information.
When performing mutations on permutations, we can't consider each gene independently since it might mess up the permutation. So the mutation parameter is the probability that a chromosome undergoes mutation, instead of the probability that each gene undergoes mutation, like in the binary and integer case.
Doing a simple crossover for permutations isn't straightforward. It was implied in the `8`-queens problem, but doing a simple substring swap(s) will most likely not preserve the permutation property.
Suppose we have two parents
`1` | `2` | `3` | `4` | `5` | `6` | `7` | `8` | `9` |
`9` | `3` | `7` | `8` | `2` | `6` | `5` | `1` | `4` |
The first step of PMX is to choose `2` crossover points and copy the segment between them from the first parent into the first child.
`1` | `2` | `3` | `4` | `5` | `6` | `7` | `8` | `9` |
`9` | `3` | `7` | `8` | `2` | `6` | `5` | `1` | `4` |
First child:
- | - | - | `4` | `5` | `6` | `7` | - | - |
The second step is to look at the segment in the second parent and see which values weren't copied over to the first child. In this example, `8` and `2`.
`1` | `2` | `3` | `4` | `5` | `6` | `7` | `8` | `9` |
`9` | `3` | `7` | `8` | `2` | `6` | `5` | `1` | `4` |
What we do with these uncopied values is best explained with a visual.
The last step is to copy the remaining values from the second parent into the same positions in the first child.
`1` | `2` | `3` | `4` | `5` | `6` | `7` | `8` | `9` |
`9` | `3` | `7` | `8` | `2` | `6` | `5` | `1` | `4` |
First child:
`9` | `3` | `2` | `4` | `5` | `6` | `7` | `1` | `8` |
The second child is produced the same way, just with the parent roles reversed (so the second parent's segment is copied into the second child as the first step).
The first step of the order `1` crossover is the same as the first step of PMX: choose `2` crossover points and copy the segment between them from the first parent into the first child.
`1` | `2` | `3` | `4` | `5` | `6` | `7` | `8` | `9` |
`9` | `3` | `7` | `8` | `2` | `6` | `5` | `1` | `4` |
First child:
- | - | - | `4` | `5` | `6` | `7` | - | - |
Then we copy the values from the second parent that weren't copied over yet, starting from the right of the rightmost crossover point and wrapping around to the beginning when we reach the end.
First child:
`3` | `8` | `2` | `4` | `5` | `6` | `7` | `1` | `9` |
The second child is produced the same way, just with the parent roles reversed (so the second parent's segment is copied into the second child as the first step).
The goal of order crossover is to transmit information about relative order from the second parent.
For this crossover technique, we copy cycles from the parents into the children. Cycles are found using the same visual method in PMX.
Cycle 1:
I (obviously) reused the same image and forgot to remove the "Child `1`".
Cycle 2:
I (obviously) reused the same image and forgot to remove the "Child `1`".
So we take the values involved in the first cycle and copy them into the children in their respective positions.
First child:
`1` | - | - | `4` | - | - | - | `8` | `9` |
Second child:
`9` | - | - | `8` | - | - | - | `1` | `4` |
Then we take the values involved in the second cycle and copy them into the children, but in the opposite positions (so the values from the first parent go into the second child and the values from the second parent go into the first child).
First child:
`1` | `3` | `7` | `4` | `2` | - | `5` | `8` | `9` |
Second child:
`9` | `2` | `3` | `8` | `5` | - | `7` | `1` | `4` |
Then we take the values involved in the third cycle and copy them into the children in their respective positions.
First child:
`1` | `3` | `7` | `4` | `2` | `6` | `5` | `8` | `9` |
Second child:
`9` | `2` | `3` | `8` | `5` | `6` | `7` | `1` | `4` |
And we keep repeating this alternating pattern until we're done.
The goal of cycle crossover is to transmit information about the absolute position in which elements occur.
A solution is represented as a vector `[x_1, x_2, ..., x_k]^T` where each `x_i in RR`.
We could replace each element with a randomly-selected value from some predefined range of values. (This is similar to bit flipping in the binary representation and random resetting in the integer representation.)
Uniform mutation is most suitable when the genes encode cardinal attributes, since all values are equally likely to be chosen.
The most common method is to add a random delta to each gene, where the delta is taken from a Gaussian distribution with mean `0` and standard deviation `sigma`.
`x_i' = x_i + N(0, sigma)`
`sigma` is the standard deviation (also referred to here as the mutation step size) and it controls how big the deltas are.
A simple crossover technique is to take an allele from one of the parents with equal probability.
Since we're dealing with real numbers, we can create children that are "between" the parents. For two parents `x,y`, we can create a child `z` using
`z_i = alpha x_i + (1-alpha) y_i`
for `0 le alpha le 1`
where `alpha` can be constant, variable, or picked randomly every time.
We can do this for only one gene:
`[x_1, ..., x_(k-1), alpha*y_k+(1-alpha)*x_k,...,x_n]`
(exact copy of parent `x` except for the `k^(th)` gene)
For several genes:
`[x_1, ..., x_k, alpha*y_(k+1)+(1-alpha)*x_(k+1),...,alpha*y_n+(1-alpha)*x_n]`
(exact copy of parent `x` except for genes `k+1` to `n`)
Or for all the genes:
`z=alpha*x+(1-alpha)*y`
Crossover is explorative and mutation is exploitative.
Only crossovers can combine information from two parents.
Only mutation can introduce new information.
Hitting the optimum usually requires a lucky mutation.
Selection operators are used for population management. They decide which individuals in the population are selected for mating (parent selection) and which individuals in the population survive (survivor selection).
Selection pressure is the intensity with which fitter individuals are favored over worse individuals. A higher selection pressure means that fitter individuals are more favored, and a lower selection pressure means that worse individuals have a higher chance of being selected.
For parent selection, it's natural to want to use higher-fitness individuals for mating. But, as mentioned before, using only high-fitness individuals can have problems with getting stuck at local optima.
Fitness proportional selection is a probabilistic method that selects individuals for mating by their absolute fitness values. Individuals with a higher fitness have a higher chance of being selected. And since we're looking at absolute fitness values, individuals with much higher fitness have a much higher chance of being selected.
`P_(FPS)(i)=f_i/(sum_(j=1)^(mu)f_j)`
For example, suppose we have some individuals `x=1, x=2, x=3` with their fitness values given below.
`x` | fitness |
---|---|
`1` | `1` |
`2` | `4` |
`3` | `9` |
We could simulate fitness proportional selection by making copies of each individual based on their fitness values. This way, if we select one copy, the probability that the copy is going to have a high fitness value is high. Individual `x=3` has a `9/14` chance of being selected versus individual `x=1`, which only has a `1/14` chance of being selected.
`1` | `2` | `2` | `2` | `2` | `3` | `3` | `3` | `3` | `3` | `3` | `3` | `3` | `3` |
One of the disadvantages of using fitness proportional selection is that it could lead to premature convergence. Since higher-fitness individuals have a significantly higher chance of being selected, the population could quickly lose diversity and converge to a single solution.
Another disadvantage of fitness proportional selection is that it is highly susceptible to function transposition. Consider the following fitness values and selection probabilities.
Individual | Fitness for `f` | Selection probability for `f` | Fitness for `f+10` | Selection probability for `f+10` | Fitness for `f+100` | Selection probability for `f+100` |
---|---|---|---|---|---|---|
`A` | `1` | `0.1` | `11` | `0.275` | `101` | `0.326` |
`B` | `4` | `0.4` | `14` | `0.35` | `104` | `0.335` |
`C` | `5` | `0.5` | `15` | `0.375` | `105` | `0.339` |
Sum | `10` | `1.0` | `40` | `1.0` | `310` | `1.0` |
We can see the selection probabilities don't stay proportional to the fitness values as we change the fitness values. `C` is `4` points higher than `A`, and in the simple case of `f`, has a much higher chance of being selected. In the case of `f+100`, `C` is still `4` points higher than `A`, but `C` has virtually the same chance of being selected as `A`.
One way of dealing with this is to use windowing. Windowing is a method that adjusts the fitness values dynamically to maintain the fitness differentials.
Let `beta^t` be the worst fitness in the `t^(th)` generation. We adjust the fitness values every generation by using
`f'(i)=f(i)-beta^t`
This essentially makes it so that the worst individual has a zero chance of being picked while still allowing high-fitness invidivudals to maintain their high probability.
Sigma scaling adjusts the fitness values by incorporating the population variance. Let `bar f` be the mean of the fitness values, `sigma_f` be the standard deviation, and `c` be some constant value (usually `2`). We adjust the fitness values every generation by using
`f'(i)=max(0, f(i)-(bar f - c*sigma_f))`
Of course, we could avoid the problems of fitness proportional selection by using something else entirely. Instead of looking at absolute fitness, we could look at relative fitness.
For example, in the earlier example, `x=1` had a `1/14` chance of being selected and `x=3` had a `9/14` chance of being selected. Looking at absolute fitness, `x=3` is `9` times more likely to be selected than `x=1`.
If we want to look at relative fitness, let's assign some rank to these individuals, say rank `2` to `x=3`, rank `1` to `x=2`, and rank `0` to `x=1` (so the higher the rank, the higher the fitness). Assigning a rank allows us to know which individual is the fittest without being biased by how much fitter an individual is.
`x` | fitness | rank |
---|---|---|
`1` | `1` | `0` |
`2` | `4` | `1` |
`3` | `9` | `2` |
Ranking introduces a sorting overhead, but this is usually negligible compared to the overhead of computing fitness values.
`P(i)=(2-s)/mu+(2i(s-1))/(mu(mu-1))`
`s` is the selection pressure, and it ranges from `1 lt s le 2`. Using a higher selection pressure favors fitter individuals.
Individual | Fitness | Rank | `P_(FPS)` | `P_(LR)` for `s=2` | `P_(LR)` for `s=1.5` |
---|---|---|---|---|---|
`A` | `1` | `0` | `0.1` | `0` | `0.167` |
`B` | `4` | `1` | `0.4` | `0.33` | `0.33` |
`C` | `5` | `2` | `0.5` | `0.67` | `0.5` |
Sum | `10` | - | `1.0` | `1.0` | `1.0` |
One limitation with linear ranking is that the selection pressure is limited. It can only take on values between `1` and `2`*. So the fittest individual only has at most a `2` times higher chance of being selected.
*Having `s gt 2` would lead to negative probabilities for less fit individuals since all the probabilities have to add up to `1`.
`P(i)=(1-e^(-i))/c`
where `c` is a normalization constant that ensures the sum of all the probabilities add up to `1`.
Exponential ranking is preferred over linear ranking when we want the fittest individuals to have more chances of being selected. This is helpful in situations where it is really costly to compute fitness values, so we only want to spend resources calculating fitness values for individuals that are worth calculating.
If the population is very large, then it can be very expensive to keep computing the fitness of the whole population. Instead, we can pick `k` members at random and then select the best individuals from that sample.
The probability of selecting an individual `i` is based on several factors:
When the selected parents produce offspring, they have to replace individuals in the population to maintain the population size.
We'll use `mu` for the population size and `lambda` for the number of offspring.
In the generational model, the whole population is replaced (assuming two parents produce two children). This results in a faster exploration of the search space. But it is also more likely to remove good traits from the population that are beneficial for producing good solutions.
For each generation, `lambda=mu`.
In the steady-state model, only some of the population is replaced. The proportion of the population that is replaced is called the generational gap (represented by `lambda/mu`).
For each generation, `lambda lt mu`.
The selection methods for selecting individuals to be replaced can be age-based or fitness-based. Age-based methods are not interesting, so we'll only look at fitness-based ones.
As the name suggests, the fittest individuals remain in the population while less fit individuals are replaced. This ensures that there is always at least one copy of the fittest solution in the population.
As the name suggests, the worst members of the population are selected for replacement. To be honest, I still don't understand the difference between elitism and GENITOR.
Here, the offspring is merged with the population and they duke it out with each other to see who wins. Each individual is evaluated against `q` other randomly chosen individuals. The individuals with the most wins stay in the population.
When `q` is large, the selection pressure is higher since stronger individuals are more likely to survive. `q` is typically `10`.
From what I understand, `mu` parents generate `lambda` offspring, and only the best `mu` children are selected to replace the whole population. (This requires `lambda gt mu`.)
Here, the offspring is merged with the population and the best `mu` individuals survive.
As mentioned earlier, the idea behind selection pressure is that fitter individuals are more likely to survive or be chosen as parents.
Selection pressure can be quantified by a measure called takeover time (denoted as `tau^(ast)`). Takeover time is the number of generations it takes for the fittest individual to take over the entire population.
The general formula for takeover time is
`tau^(ast)=ln(lambda)/ln(lambda/mu)`
For example, for `mu=15` and `lambda=100`, `tau^(ast)~~2`, which means that in `2` generations, every individual will be the same (the best individual).
For fitness proportional selection, the takeover time is
`tau^(ast)=lambda ln(lambda)`
As we've seen, there are many choices for which representation, selection operators, and variation operators to use. The set of choices that are used defines an "instance" of an evolutionary algorithm.
Historically, many different types of evolutionary algorithms have been developed.
The genetic algorithm (GA) was developed in the 1960s by Holland in his study of adaptive behavior.
Representation | Bit-strings |
Recombination | 1-Point crossover |
Mutation | Bit flip |
Parent selection | Fitness proportional - implemented by Roulette Wheel |
Survivor selection | Generational |
The mutation probability is typically low since there is more emphasis on recombination as the preferred way of generating offspring.
GA is typically used for straightforward binary representation problems, function optimization, and teaching evolutionary algorithms.
Evolution strategies (ES) were developed in the 1960s by Rechenberg and Schwefel in their study of shape optimization.
Representation | Real-valued vectors |
Recombination | Discrete or intermediary |
Mutation | Gaussian perturbation |
Parent selection | Uniform random |
Survivor selection | Deterministic elitist replacement by `(mu, lambda)` or `(mu+lambda)` |
Specialty | Self-adaptation of mutation step sizes |
Self-adaptation means that the mutation step size changes from generation to generation. This is implemented by Rechenberg's `1//5` success rule, which scales `sigma` by a factor depending on the success rate of a mutation (whatever that means).
Here are some different types of recombination methods. `i` is the index of the gene, `x` and `y` are the parents, and `z` is the child.
Two fixed parents | Two parents selected for each `i` | |
---|---|---|
`z_i=(x_i+y_i)/2` | local intermediary | global intermediary |
`z_i` is `x_i` or `y_i` chosen randomly | local discrete | global discrete |
ESs typically use global recombination.
`(mu,lambda)` selection is generally preferred over `(mu+lambda)` selection because `(mu+lambda)` has a tendency to keep bad solutions (and the traits that produce the bad solutions) around for a large number of generations.
ESs are typically used for numerical optimization and shape optimization.
Evolutionary programming (EP) was developed in the 1960s by Fogel et al. in their study of generating artificial intelligence (by attempting to simulate evolution as a learning process).
Representation | Real-valued vectors |
Recombination | None |
Mutation | Gaussian perturbation |
Parent selection | Deterministic (each parent creates one offspring via mutation) |
Survivor selection | Probabilistic `(mu+mu)` |
Specialty | Self-adaptation of mutation step sizes |
The notable thing about EP is that there is no recombination. This is because each individual is seen as a distinct species, so it doesn't make sense theoretically to perform recombination. This also means that only `1` parent is needed to produce a child, so there's not really a parent selection mechanism.
This has, of course, led to much debate and study on the performance of algorithms with and without recombination.
Survivor selection is implemented using stochastic round-robin tournaments.
EP is typically used for numerical optimization.
Genetic programming (GP) was developed in the 1990s by Koza.
Representation | Tree structures |
Recombination | Exchange of subtrees |
Mutation | Random change in trees |
Parent selection | Fitness proportional |
Survivor selection | Generational replacement |
Similarly to GA, the mutation probability is low for GP. In fact, Koza suggested using a mutation rate of `0%`. The idea behind not needing mutation is that recombination already has a large shuffling effect, so it's kind of a mutation operator already.
For large population sizes, over-selection is a method that is used for parent selection.
The population is ranked by fitness, then split into two groups. One group contains the top `x%` and the other contains the remaining `(100-x)%`. `80%` of the selection operations come from the first group, and the other `20%` come from the other group.
GP is typically used for machine learning tasks, like prediction and classification.
Differential evolution (DE) was developed in 1995 by Storn and Price.
Representation | Real-valued vectors |
Recombination | Uniform crossover (with crossover probability `C_r`) |
Mutation | Differential mutation |
Parent selection | Uniform random selection of the `3` necessary vectors |
Survivor selection | Deterministic elitist replacement (parent vs. child) |
The notable thing about DE is that `3` parents are needed to create a new child. We can see why when we look at how differential mutation works. Basically, the vector difference of two parents is added to another parent to produce the child.
Let `x,y,z` be the parents. The scaled vector difference of `y` and `z` is denoted as `p`, the perturbation vector. The child produced is denoted `x'`. `F` is the scaling factor that controls the rate at which the population evolves.
`p = F*(y-z)`
`x' = x + p`
For recombination, uniform crossover is used, meaning that each gene either comes from the first parent or the second parent. But there is a slight difference from the usual uniform crossover; there is a crossover probability `C_r in [0,1]` that decides the probability of picking the gene from the first parent.
DE is typically applied to nonlinear and nondifferentiable continuous space functions.
Each evolutionary algorithm variant differs in their choices for which operators to use. These choices are the parameters of the algorithm. There are two categories of parameters: symbolic parameters and numeric parameters.
Symbolic parameters are the categorical aspects of the algorithm. For example, what type of parent selection should be used? Or what type of recombination should be used?
Numeric parameters are the, well, numeric aspects of the algorithm. For example, what should the population size be? Or what should be the mutation probability?
Here's an example with three evolutionary algorithm instances `A_1`, `A_2`, and `A_3`.
`A_1` | `A_2` | `A_3` | |
---|---|---|---|
symbolic parameters | |||
representation | bitstring | bitstring | real-valued |
recombination | `1`-point | `1`-point | averaging |
mutation | bit-flip | bit-flip | Gaussian `N(0,sigma)` |
parent selection | tournament | tournament | uniform random |
survivor selection | generational | generational | `(mu,lambda)` |
numeric parameters | |||
`p_m` | `0.01` | `0.1` | `0.05` |
`sigma` | n.a. | n.a. | `0.1` |
`p_c` | `0.5` | `0.7` | `0.7` |
`mu` | `100` | `100` | `10` |
`lambda` | equal `mu` | equal `mu` | `70` |
`k` | `2` | `4` | n.a. |
More formally, an evolutionary algorithm instance is a set of symbolic and numeric parameters.
`A_i=(S,N)`
`S={text(representation), text(recombination), text(mutation), text(parent selection), text(survivor selection)}`
`N={p_m, sigma, p_c, mu, lambda, k}`
where
For two instances `A_i=(S_i,N_i)` and `A_j=(S_j,N_j)`, they are considered distinct evolutionary algorithms if at least one of the symbolic parameters is different.
`A_i ne A_j` if `S_i ne S_j`
They are considered variants of each other if they have the same symbolic parameters but different numeric parameters.
`A_i ~ A_j` if `S_i = S_j` and `N_i ne N_j`
In the above example, `A_1` and `A_2` are variants of each other and `A_1` and `A_3` are distinct algorithms.
Abstractly, there are three layers for an evolutionary algorithm: the design layer, the algorithm layer, and the application layer. The design layer involves choosing the parameters to create the algorithm instance. I think the algorithm layer involves defining what type of evolutionary algorithm it is, like identifying it as a genetic algorithm for example. The application layer involves running the algorithm on a real-world problem.
With this, there are essentially two problems to solve: finding a solution to a real-world problem (optimization problem) and finding a set of parameters to design an algorithm to find those solutions (design problem). As a result, there are two types of quality assessments: one at the algorithm layer and one at the application layer.
At the algorithm layer, we have utility and testing. Utility is the measure of an EA's performance (e.g., number of fitness evaluations, algorithm runtime). Testing is the process of calculating and comparing utilities.
At the application layer, we have fitness and evaluation. Fitness is the measure of a solution's quality. Evaluation is the process of calculating and comparing fitness values.
Parameter tuning refers to deciding what the parameters are before the algorithm is run.
Parameter control refers to adjusting the parameters while the algorithm is running. It can be deterministic (e.g., depending on how much time has passed), adaptive (changing based on the feedback from the search results), or self-adaptive (changing based on the quality of the solutions).
A tuner is a search algorithm that looks for the best set of parameters. In order to test a tuner's performance, we need a utility function that assigns a performance score to the tuner. Some measures that could be used for utility are mean best fitness and average number of evaluations to a solution.
The general flow for tuning is to generate sets of parameters and test each set. Since EAs are stochastic, the testing needs to be repeated several times. (Depending on the type of tuner -- iterative or non-iterative -- the generation of the parameters can happen once.)
Utility is defined formally as
`U(A_i)=bbb E(M(A_i,P))`
where
Utility is estimated by the testing process.
`hat U(A_i)=1/n sum_(j=1)^n M(A_i^((j)),P)`
where `n` is the number of runs and `A_i^((j))` is the algorithm used on the `j^(th)` run.
So far, we've been dealing with problems where the goal is to optimize only one objective function. But in the real world, most problems have multiple objectives that need to be minimized or maximized. For example, when buying a car, we want to find a nice balance between price and reliability.
Formally, the general multiobjective optimization problem (MOP) is defined as finding the decision variable vector
`bb x = [x_1, x_2, ..., x_n]^T`
that satisfies a set of constraints and optimizes a set of objective functions
`bb f(bb x) = [f_1(bb x), f_2(bb x), ..., f_k(bb x)]^T`
Suppose we have two objective functions `f_1` and `f_2`, and we want to minimize both of them. `a,b,c,d,e` are solutions.
From this we can see that
This introduces the concept of dominance. We say that `a` dominates `b` and `c`, and `e` dominates `a`.
Formally, a solution `x` dominates a solution `y` if
`x` dominates `y` is written as
`x preceq y` for minimization
`x succeq y` for maximization
This is generally what it looks like for the minimization case:
It's rare (at least I imagine it is) for there to exist one solution that optimizes all objective functions. More likely, there will be a set of solutions that are all fairly good, but they each have their own benefits and tradeoffs. These are referred to as Pareto optimal solutions.
In the example below, the green solutions are the Pareto optimal solutions. They're better than all the other solutions, but each Pareto optimal solution is not definitively better than any other Pareto optimal solution.
Formally, a solution `x` is non-dominated if no solution dominates `x`. A set of non-dominated solutions is the Pareto optimal set. The image (in the context of functions) of the Pareto optimal set is the Pareto optimal front.
The set of Pareto optimal solutions is defined as
`cc P^(ast) = {x in Omega | not EE x' in Omega text( ) f(x') preceq f(x)}`
(there does not exist a solution that dominates a Pareto optimal solution)
The Pareto front is defined as
`cc P cc F^(ast) = {(f_1(x), ..., f_k(x)) | x in cc P^(ast)}`
NSGA stands for Nondominated Sorting Genetic Algorithm. It is an algorithm that can find multiple Pareto optimal solutions in a single simulation run. It works by classifying individuals into layers based on dominance. For individuals that are in the same layer, the solution that is in the lesser crowded region is considered better.
MOEA/D stands for Multiobjective Evolutionary Algorithm Based on Decomposition. It decomposes the optimization problem into `n` scalar optimization subproblems, which are solved in parallel.