How to Escape Local Optima in Black Box Optimisation: When Non-elitism Outperforms Elitism

Escaping local optima is one of the major obstacles to function optimisation. Using the metaphor of a fitness landscape, local optima correspond to hills separated by fitness valleys that have to be overcome. We define a class of fitness valleys of tunable difficulty by considering their length, representing the Hamming path between the two optima and their depth, the drop in fitness. For this function class we present a runtime comparison between stochastic search algorithms using different search strategies. The (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1+1$$\end{document}1+1) EA is a simple and well-studied evolutionary algorithm that has to jump across the valley to a point of higher fitness because it does not accept worsening moves (elitism). In contrast, the Metropolis algorithm and the Strong Selection Weak Mutation (SSWM) algorithm, a famous process in population genetics, are both able to cross the fitness valley by accepting worsening moves. We show that the runtime of the (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1+1$$\end{document}1+1) EA depends critically on the length of the valley while the runtimes of the non-elitist algorithms depend crucially on the depth of the valley. Moreover, we show that both SSWM and Metropolis can also efficiently optimise a rugged function consisting of consecutive valleys.


Introduction
Black box algorithms are general purpose optimisation tools typically used when no good problem specific algorithm is known for the problem at hand. No particular knowledge is required for their application and they have been reported to be surprisingly effective. Popular classes are evolutionary algorithms, ant colony optimisation and artificial immune systems. These examples fall into the family of bio-inspired heuristics, but there are many other black box algorithms, including Simulated Annealing or Tabu Search. While many successful applications of these algorithms have been described, it is still hard to decide in advance which algorithm is preferable for a given problem. An initial natural research topic towards understanding the capabilities of a given algorithm is to identify classes of problems that are easy or hard for it [2,6,9,13,38]. However, the easiest and hardest classes of problems often are not closely related to real world applications. A more general question that applies to virtually any multimodal optimisation problem is to understand how efficient a given algorithm is in escaping from local optima.
Families of black box algorithms mainly differ in the way new solutions are generated (i.e. variation operators), how solutions are chosen for the next iterations (i.e. selection) and how many solutions are used by the heuristic in each iteration (i.e. population). Different variation operators, selection operators, population sizes and combinations of these lead to different algorithmic behaviours. In this paper we analyse the effects of mutation and selection in overcoming local optima.
Two different approaches are commonly used by most black box algorithms. One strategy is to rely on variation operators such as mutation to produce new solutions of high fitness outside the basin of attraction of the local optimum. These are unary operators that construct a new candidate solution typically by flipping bits of an existing solution. Elitist algorithms (i.e. those that never discard the best found solution), mainly rely on such strategies when stuck on a local optimum. In a population-based algorithm different individuals may use different mutation rates to help escape local optima faster [23]. Other variation operators may escape even faster than mutation. Population-based algorithms can recombine different solutions through the crossover operator to reach points outside the area of attraction of the local optima [14]. This operation requires that sufficient diversity is available in the population which may be introduced by using some diversity-enforcing mechanism [4]. Recently it has been shown that the interplay between the two variation operators, mutation and crossover, may efficiently give rise to the necessary burst of diversity without the need of any artificial diversity mechanism [3]. Another combination that has been proven to be effective for elitist algorithms to overcome local optima is to alternate mutations with variable depth search [35]. A common approach used in practice is to restart the algorithm or perform several runs in parallel with the hope that the algorithm does not get stuck on the same local optima every time.
A very different approach is to attempt to escape by accepting solutions of lower fitness in the hope of eventually leaving the basin of attraction of the local optimum. This approach is the main driving force behind non-elitist algorithms. Compared to the amount of work on elitist black box algorithms, there are few theoretical works analysing the performance of non-elitism (see, e. g. [5,15,16,20,21,25,26,31,33,36]). While both approaches may clearly be promising, it is still unclear when one should be preferred to the other. In this paper we investigate this topic by considering the areas between consecutive local optima, which we call fitness valleys. These valleys can have arbitrary length i.e., the distance between the local optima, and arbitrary depth d i.e., the difference in function values between the optima and the point of minimum fitness between them. More precisely, we define a valley on a Hamming path (a path of Hamming neighbours) to ensure that mutation has the same probability of going forward on the path as going backwards. The valley is composed of a slope of length 1 descending towards a local minimum from which a slope of increasing fitness of length 2 can be taken to reach the end of the valley. The steepness of each slope is controlled by parameters d 1 and d 2 , respectively indicating the fitness of the two local optima at the extreme left and extreme right of the valley. a sketch of a fitness valley is shown in Fig. 1. Our aim is to analyse how the characteristics of the valley impact the performance of elitist versus non-elitist strategies.
We point out that understanding how to cross fitness valleys efficiently is a very important problem also in biology [37]. From a biological perspective, crossing fitness valleys represents one of the major obstacles to the evolution of complex traits. Many of these traits require accumulation of multiple mutations that are individually harmful for their bearers; a fitness advantage is achieved only when all mutations have been acquired-a fitness valley has been crossed. We refer the interested reader to [27] for an attempt to unify evolutionary processes in computer science and population genetics.
We consider the simple elitist (1 + 1) EA, the most-studied elitist evolutionary algorithm, and compare its ability to cross fitness valleys with the recently introduced non-elitist Strong Selection Weak Mutation (SSWM) algorithm inspired by a model of biological evolution in the 'strong selection, weak mutation regime' [28,29]. This regime applies when mutations are rare enough and selection is strong enough that the time between occurrences of new mutations is long compared to the time a new genotype takes to replace its parent genotype, or to be lost entirely [8]. Mutations occur rarely, therefore only one genotype is present in the population most of the time, and the relevant dynamics can be characterized by a stochastic process on one genotype. The significant difference between the SSWM algorithm and the (1 + 1) EA is that the former may accept solutions of lower quality than the current solution and even reject solutions of higher quality.
Recently, Paixão et al. investigated SSWM on Cliff d [28], a function defined such that non-elitist algorithms have a chance to jump down a "cliff" of height roughly d and to traverse a fitness valley of Hamming distance d to the optimum. The function is a generalised construction of the unitation function (a function that only depends on the number of 1-bits in the bit string) introduced by Jägersküpper and Storch to give an example class of functions where a (1, λ) EA outperforms a (1 + λ) EA [12]. This analysis revealed that SSWM can cross the fitness valley. However, upon comparison with the (1+1) EA, SSWM achieved only a small speed-up: the expected time (number of function evaluations) of SSWM is at most n d /e Ω(d) , while the (1 + 1) EA requires Θ(n d ) [28].
In this manuscript, we show that greater speed-ups can be achieved by SSWM on fitness valleys. Differently to the work in [28] where global mutations were used, here we only allow SSWM to use local mutations because we are interested in comparing the benefits of escaping local optima by using non-elitism to cross valleys against the benefits of jumping to the other side by large mutations. Additionally, local mutations are a more natural variation operator for SSWM because they resemble more closely the biological processes from which the algorithm is inspired.
After presenting some Preliminaries, we build upon Gambler's Ruin theory [7] in Sect. 3 to devise a general mathematical framework for the analysis of non-elitist algorithms using local mutations for crossing fitness valleys. We use it to rigorously show that SSWM is able to efficiently perform a random walk across the valley using only local mutations by accepting worse solutions, provided that the valley is not too deep. On the other hand, the (1 + 1) EA cannot accept worse solutions and therefore relies on global mutations to reach the other side of the valley in a single jump. More precisely, the (1 + 1) EA needs to make a jump across all valley points that have lower fitness; we call this the effective length of the valley.
As a result, the runtime of the (1 + 1) EA is exponential in the effective length of the valley while the runtime of SSWM depends crucially on the depth of the valley. We demonstrate the generality of the presented mathematical tool by using it to prove that the same asymptotic results achieved by SSWM also hold for the well-known Metropolis algorithm (simulated annealing with constant temperature) that, differently from SSWM, always accepts improving moves. Jansen and Wegener [15] previously compared the performance of the (1+1) EA and Metropolis for a fitness valley encoded as a unitation function where the slopes are symmetric and of the same length. They used their fitness valley as an example where the performance of the two algorithms is asymptotically equivalent.
The framework also allows the analysis for concatenated "paths" of several consecutive valleys, creating a rugged fitness landscape that loosely resembles a "big valley" structure found in many problems from combinatorial optimisation [1,19,22,30]. In particular, in Sect. 4 we use it to prove that SSWM and Metropolis can cross consecutive paths in expected time that depends crucially on the depth and number of the valleys. Note that our preliminary work [24] required the more restrictive condition that the slope towards the optimum should be steeper than the one in the opposite direction i.e., d 2 / 2 > d 1 / 1 . In this paper we have relaxed the conditions to consider only the depths of the valleys, i.e. d 2 > d 1 . This generalisation allows the results to hold for a broader family of functions.

Algorithms
In this paper we present a runtime comparison between the (1 + 1) EA and two nonelitist nature-inspired algorithms, SSWM and Metropolis. While they match the same basic scheme shown in Algorithm 1, they differ in the way they generate new solutions (mutate(x) function), and in the acceptance probability of these new solutions ( p acc function). The (1 + 1) EA relies on global mutations to cross the fitness valley and the function mutate(x) flips each bit independently with probability 1/n. Conversely, SSWM and Metropolis analysed here use local mutations, hence the function mutate(x) flips a single bit chosen uniformly at random. Furthermore, the (1 + 1) EA always accepts a better solution, with ties resolved in favour of the new solution. The probability of acceptance is formally described by where Δf is the fitness difference between the new and the current solution. SSWM accepts candidate solutions with probability (see Fig. 2) where N ≥ 1 is the size of the "population" that underlies the biological SSWM process as explained in the following paragraph, β represents the selection strength and Δf = 0. For Δf = 0 we define p acc (0) := lim Δf →0 p acc (Δf ) = 1 N . If N = 1, this probability is p acc (Δf ) = 1, meaning that any offspring will be accepted, and if N → ∞, it will only accept solutions for which Δf > 0. SSWM's acceptance function depends on the absolute difference in fitness between genotypes. It introduces two main differences compared to the (1 + 1) EA: first, solutions of lower fitness may be accepted with some positive probability, and second, solutions of higher fitness can be rejected.
Equation (1), first derived by Kimura [17], represents the probability that a gene that is initially present in one copy in a population of N individuals is eventually present in all individuals (the probability of fixation). Hence, Algorithm 1 takes a macro view to the adaptation process in that each iteration of the process models the appearance of a new mutation and its subsequent fate: either it is accepted with probability p acc , increasing to frequency 1 and replacing the previous genotype, or it is not and is lost. It is important to note that the population size N refers to the biological SSWM regime [29]. From the algorithmic perspective N is just a parameter of a single evolving individual.
The acceptance function p acc is strictly increasing with the following limits: lim Δf →−∞ p acc (Δf ) = 0 and lim Δf →∞ p acc (Δf ) = 1. The same limits are obtained when β tends to ∞, and thus for large |βΔf | the probability of acceptance is close to the one of the (1 + 1) EA, as long as N > 1, defeating the purpose of the comparison, with the only difference being the tie-breaking rule: SSWM only accepts the new equally good solution with probability 1/N [28].
Finally, the Metropolis algorithm is similar to SSWM in the sense that it is able to accept mutations that decrease fitness with some probability. However, unlike SSWM, for fitness improvements it behaves like the (1 + 1) EA in that it accepts any fitness improvement. Formally, Metropolis' acceptance function can be described by: where α is the reciprocal of the "temperature". Temperature in the Metropolis algorithm plays the same role as population size in SSWM: increasing the temperature (decreasing α) increases the probability of accepting fitness decreases. The acceptance functions of all three algorithms are shown in Fig. 2.

Long Paths
Previous work on valley crossing [12,15,28] used functions of unitation to encode fitness valleys, with 1 n being a global optimum. The drawback of this construction is that the transition probabilities for mutation heavily depend on the current position.
The closer an algorithm gets to 1 n , the larger the probability of mutation decreasing the number of ones and moving away from the optimum. We follow a different approach to avoid this mutational bias, and to ensure that the structure of the fitness valley is independent of its position in the search space. This also allows us to easily concatenate multiple valleys.
We base our construction on so-called long k-paths, paths of Hamming neighbours with increasing fitness whose length can be exponential in n. These paths were introduced and investigated experimentally in [11] and subsequently formalised and rigorously analysed in [32]. Exponential lower bounds were shown in [6]. An example of a long k-path is shown in Table 1. The following formal, slightly revised definition is taken from [34, p. 2517].
Definition 1 Let k ∈ N and n be a multiple of k. The long k-path of dimension n is a sequence of bit strings from {0, 1} n defined recursively as follows. The long kpath of dimension 0 is the empty bit string. Assume the long k-path of dimension n − k is given by the sequence P k n−k = ( p 1 , . . . , p ), where p 1 , . . . , p ∈ {0, 1} n−k and is the length of P k n−k . Then the long k-path of dimension n is defined by prepending k bits to these search points: let S 0 := (0 k p 1 , 0 k p 2 , . . . , 0 k p ), S 1 := (1 k p , 1 k p −1 , . . . , 1 k p 1 ), and B := (0 k−1 1 p , 0 k−2 1 2 p , . . . , 01 k−1 p ). The search points in S 0 and S 1 differ in the k leading bits and the search points in B represent a bridge between them. The long k-path of dimension n, P k n , is the concatenation of S 0 , B, and S 1 .
An exponential length implies that the path has to be folded in {0, 1} n in a sense that there are i < j such that the i-th and the j-th point on the path have Hamming distance H(·, ·) smaller than j − i. Standard bit mutations have a positive probability of jumping from the i-th to the j-th point, hence there is a chance to skip large parts of the path by taking a shortcut. However, long k-paths are constructed in such a way that at least k bits have to flip simultaneously in order to take a shortcut of length at least k. The probability of such an event is exponentially small if k = Θ( √ n), in which case the path still has exponential length. Long k-paths turn out to be very useful for our purposes. If we consider the first points of a long k-path and assign increasing fitness values to them, we obtain a fitness-increasing path of any desired length (up to exponential in n [34, Lemma 3]). Table 1 Example of a long k-path for n = 9 and k = 3: P 3 9 = (P 0 , P 1 , . . . , P 21 ) Given two points P s , P s+i for i > 0, P s+i is called the i-th successor of P s and P s is called a predecessor of P s+i . Long k-paths have the following properties.
Lemma 1 (Long paths) 1. For every i ∈ N 0 and path points P s and P s+i ,

The probability of a standard bit mutation turning
Proof The first statement was shown in [34, p. 2517] (refining a previous analysis in [6, p. 73]). The second statement follows from the first one, using that the probability of mutating at least k bits is at most n k n −k ≤ 1/(k!).
In the following, we fix k := √ n such that the probability of taking a shortcut on the path is exponentially small. We assign fitness values such that all points on the path have a higher fitness than those off the path. This fitness difference is made large enough such that the considered algorithms are very unlikely to ever fall off the path. Assuming that we want to use the first m path points P 0 , . . . , P m−1 , then the fitness is given by where h(i) gives the fitness (height) of the i-th path point. Then, assuming the algorithm is currently on the path, the fitness landscape is a one-dimensional landscape where (except for the two ends) each point has a Hamming neighbour as predecessor and a Hamming neighbour as successor on the path. Local mutations will create each of these with equal probability 1/n. If we call these steps relevant and ignore all other steps, we get a stochastic process where in each relevant step we create a mutant up or down the path with probability 1/2 each (for the two ends we assume a self-loop probability of 1/2). The probability whether such a move is accepted then depends on the fitness difference between these path points.
It then suffices to study the expected number of relevant steps, as we obtain the expected number of function evaluations by multiplying with the expected waiting time n/2 for a relevant step.

Lemma 2 Let E (T ) be the expected number of relevant steps for any algorithm described by Algorithm 1 with local mutations finding a global optimum. Then the respective expected number of function evaluations is n/2·E (T ), unless the algorithm falls off the path.
In the following, we assume that all algorithms start on P 0 . This behaviour can be simulated from random initialisation with high probability by embedding the path into a larger search space and giving hints to find the start of the path within this larger space [34]. As such a construction is cumbersome and does not lead to additional insights, we simply assume that all algorithms start in P 0 .

Crossing Simple Valleys
On the first slope starting at point P 0 the fitness decreases from the initial height d 1 ∈ R + until the path point P 1 with fitness 0. Then the second slope begins with fitness increasing up to the path point P 1 + 2 of fitness d 2 ∈ R + . The total length of the path is = 1 + 2 . We call such a path Valley.
Here, d 1 1 and d 2 2 indicate the steepness of the two slopes (see Fig. 1). In this paper we will use d 2 > d 1 to force the point P to be the optimum.

Analysis for the (1 + 1) EA
We first show that the runtime of the (1 + 1) EA depends on the effective length * of the valley, defined as the distance between the initial point P 0 and the first valley point of greater or equal fitness. Here we restrict parameters to 1 + 2 ≤ √ n/4, as then the probability of the (1 + 1) EA taking a shortcut is no larger than the probability of jumping by a distance of 1 + 2 : 1 Proof Let us first recall that due to its elitism the (1 + 1) EA can not fall off the path.
To cross the fitness valley the (1 + 1) EA needs to jump from P 0 to a point with higher fitness, thus it has to jump at least a distance * . The probability of such a jump can be bounded from below using Lemma 1 by resulting in an expected time needed to jump over the valley of at most en * = O(n * ).
After jumping over the valley, the (1 + 1) EA has to climb at most the remaining 2 ≤ 2 steps, and each improvement has a probability of at least 1/(en). The expected time for this climb is thus at most e 2 n. As 2 < n and * ≥ 1 ≥ 2, this time is O(n * ).
Note that, in case P * has the same fitness as P 0 , the (1+1) EA can jump back to the beginning of the path, in which case it needs to repeat the jump. However, conditional on leaving P * , the probability that a successor is found is at least Ω (1). Hence in expectation O(1) jumps are sufficient.
Furthermore, the probability of the jump can be bounded from above by the probability of jumping to any of the next potential √ n path points and by the probability of taking a shortcut (see Lemma 1) Thus the expected time is Ω(n * ).

A General Framework for Local Search Algorithms
We introduce a general framework to analyse the expected number of relevant steps of non-elitist local search algorithms (Algorithm 1 with local mutations) for the Valley problem. As explained in Sect. 2.2, in a relevant step mutation creates a mutant up or down the path with probability 1/2, and this move is accepted with a probability that depends only on the fitness difference. For slopes where the gradient is the same at every position, this resembles a gambler's ruin process.
To apply classical gambler ruin theory (see e.g. [7]) two technicalities need to be taken into account. Firstly, two different gambler ruin games need to be considered, one for descending down the first slope and another one for climbing up the second slope. The process may alternate between these two ruin games as the extreme ends of each game at the bottom of the valley are not absorbing states. Secondly, a non-elitist algorithm could reject the offspring individual even when it has a higher fitness than its parent. Hence the probabilities of winning or losing a dollar (i.e., the probabilities of moving one step up or down in the slope) do not necessarily add up to one, but loop probabilities of neither winning or losing a dollar need to be taken into account when estimating expected times (winning probabilities are unaffected by self-loops).
Theorem 4 (Gambler's Ruin with self-loops) Consider a game where two players start with n 1 ∈ N + and n 2 ∈ N + dollars respectively. In each iteration player 1 wins one of player's 2 dollars with probability p 1 , player 2 wins one of player's 1 dollars with probability p 2 , and nothing happens with probability 1 − p 1 − p 2 . Then the probability of player 1 winning all the dollars before going bankrupt is: The expected time until either of both players become bankrupt i.e. the expected duration of the game is Proof The proof follows directly from the results of the standard problem ( p 1 + p 2 = 1) see e.g. Chapter XIV in [7]. The only effect of the self-loops is to add extra iterations in the problem where nothing happens, therefore the winning probabilities will not be affected, however the expected duration of the game will be increased by the waiting time needed for a relevant iteration 1/( p 1 + p 2 ).
In order to simplify the calculations we have developed the following notation.
Definition 2 (Framework's notation) The Valley problem can be considered as a Markov chain with states {P 0 , P 1 , . . . , P 1 −1 , P 1 , P 1 +1 , . . . , P 1 + 2 }. For simplicity we will sometimes refer to these points only with their sub-indices {0, 1, . . . , 1 − 1, 1 , 1 + 1, . . . , 1 + 2 }. For any stochastic process on the Valley problem we will denote by: (1) p i→ j the probability of moving from state i to j ∈ {i −1, i, i +1} in one iteration, (2) p GR i→k the probability of a Gambler's Ruin process starting in i finishing in k before reaching the state i − 1, the expected duration until either the state i − 1 or k is reached, (4) E (T i→m ) the expected time to move from state i to state m.
The following lemmas simplify the runtime analysis of any algorithm that matches the scheme of Algorithm 1 for local mutations and some reasonable conditions on the selection operator.
Lemma 5 Consider any algorithm described by Algorithm 1 with local mutations and the following properties on Valley with 1 , 2 ∈ {2, 3, . . . } and d 1 , Then the expected number of relevant steps for such a process to reach the point P 1 + 2 starting from P 0 is Property (iii) describes a common feature of optimisation algorithms: the selection operator prefers fitness increases over decreases (e.g. Randomised Local Search, (1 + 1) EA or Metropolis). Then, the bottleneck of Valley seems to be climbing down the first 1 steps since several fitness decreasing mutations have to be accepted.
Once at the bottom of the valley P 1 the process must keep moving. It could be the case that the algorithm climbs up again to P 0 . But under some mild conditions it will only have to repeat the experiment a constant number of times (property (i) of the following lemma). Finally, the algorithm will have to climb up to P 1 + 2 . This will take linear time in 2 , provided the probability of accepting an improvement p 1 → 1 +1 is by a constant greater than accepting a worsening of the same size p 1 +1→ 1 , as required by property (ii).
Consider an algorithm with a selection operator that satisfies condition (iii) such as Metropolis or SSWM. In order to satisfy the first two conditions, the selection strength must be big enough to accept the two possible fitness increases of Valley (d 1 / 1 and d 2 / 2 ) with constant probability. As we will see at the end of this section, this condition directly translates to βd 1 / 1 , βd 2 / 2 = Ω(1) for SSWM and αd 1 / 1 , αd 2 / 2 = Ω(1) for Metropolis.
In order to prove the previous lemma we will make use of the following lemma that shows some implications of the conditions from the previous lemma.

Lemma 6 In the context of Lemma 5, properties (i) and (ii) imply that
For the sake of readability the proof of Lemma 6 can be found in the appendix.

Proof of Lemma 5
Since the algorithm only produces points in the Hamming neighbourhood it will have to pass through all the states on the path. We break down the set of states in three sets and expand the total time as the sum of the optimisation time for those three sets: Note that the lower bound follows directly. Let us now consider the upper bound. We start using a recurrence relation for the last term: once in state 1 , after one iteration, the algorithm can either move to state 1 + 1 with probability p 1 → 1 +1 , move to state 1 − 1 with probability p 1 → 1 −1 or stay in state 1 with the remaining probability (if the mutation is not accepted).
Using E T 1 −1→ 1 + 2 ≤ E T 0→ 1 + 2 this expression reduces to Solving the previous expression for E T 1 → 1 + 2 leads to Since property (i) of Lemma 5 implies that the denominator is a constant 1/c 1 , we get Let us now focus on the term E T 1 +1→ 1 + 2 . Since the acceptance probability is a function of Δf , for both sides of the valley the probabilities of moving to the next or previous state remain constant during each slope and we can cast the behaviour as a Gambler's Ruin problem. Then, when the state is P 1 +1 a Gambler's Ruin game (with self-loops) occurs. The two possible outcomes are: (1) the problem is optimised or (2) we are back in P 1 . Hence, Now we introduce (6) in (5), obtaining Solving for E T 1 → 1 + 2 yields By Lemma 6, properties (i) and (ii) of Lemma 5 imply that the denominator is a constant 1/c 2 . Hence, We introduce this into (4), leading to Solving for E T 0→ 1 + 2 leads to Again by Lemma 6, properties (i) and (ii) of Lemma 5 imply that the denominator is a constant 1/c 3 . Hence, Now we consider the last term. Due to property (ii) of Lemma 5, once in 1 + 1 there is a constant probability of moving towards the optimum. Since the algorithm has to cover a distance of 2 . Plugging this into (7) proves the claimed upper bound. Now we estimate the time to move from P 0 to P 1 . As in the previous proof, the main arguments are a recurrence relation and a Gambler's Ruin game.

Lemma 7 Consider any algorithm described by Algorithm 1 with local mutations on
Valley with 1 , 2 ∈ N\{1} and d 1 , d 2 ∈ R + . Then the number of relevant steps to go from the state P 1 to P 1 is Proof At the state P 1 a Gambler's Ruin game (with self-loops) occurs. The two possible outcomes are: (1) we have reached the valley P 1 or (2) we are back to P 0 . Hence, Solving for E T 1→ 1 leads to which, by using 1 − p GR 1→0 = p GR 1→ 1 , simplifies to

Application to SSWM
In this subsection we make use of the previous framework to analyse the SSWM for the Valley problem. To apply this framework we need to know how a Gambler's Ruin with the acceptance probabilities of the SSWM behaves. When dealing with these probabilities the ratio between symmetric fitness variations appears often. The next lemma will be very helpful to simplify this ratio. Due to the sigmoid expression of the SSWM acceptance probability [Eq. (1)], it can be helpful to use bounds given by simpler expressions. Lemma 1 in [28] provides such bounds.
Lemma 9 (Lemma 1 in [28]) For every β ∈ R + and N ∈ N + the following inequalities hold. If Δf ≥ 0 then The following lemma contains bounds on the expected duration of the game and winning probabilities for SSWM. Although Valley has slopes of d 1 / 1 and d 2 / 2 , SSWM through the action of the parameter β sees an effective gradient of β · d 1 / 1 and β · d 2 / 2 . Varying this parameter allows the algorithm to accommodate the slope to a comfortable value. We have set this effective gradient to β|Δf | = Ω(1) so that the probability of accepting an improvement is at least a constant.

Lemma 10 (SSWM Gambler's Ruin) Consider a Gambler's Ruin problem as described in
Theorem 4 with starting dollars n 1 = 1 and n 2 = − 1, and probabilities p 1 and p 2 dependant on SSWM's acceptance function as follows where Δf < 0 and (N − 1)β|Δf | = Ω (1). Then the winning probability of player one P GR 1→ 1 can be bounded as follows −2(N − 1)βΔf e −2(N −1)β(n 1 +n 2 )Δf ≤ P GR 1→ 1 ≤ e −2(N −1)βΔf e −2(N −1)β(n 1 +n 2 )Δf − 1 and the expected duration of the game will be E T GR 1, Proof We start with the winning probability. Invoking Theorem 4 and simplifying the ratio of p fix of symmetric fitness variations with Lemma 8 we obtain Notice that this is the same expression as the acceptance probability if we change β for (N − 1)β and N for . Then we can apply the bounds for the original acceptance probabilities from Lemma 9 to obtain the inequalities of the theorem's statement.
Finally, for the expected duration of the game we call again Theorem 4 Note that in the last step we have used Lemma 8,and  While the optimisation time of the (1+1) EA grows exponentially with the length of the valley, the following theorem shows that for the SSWM the growth is exponential in the depth of the valley.
The conditions βd 1 / 1 , βd 2 / 2 = Ω(1) are identical to those in Lemma 10: SSWM must have a selection strength β strong enough such that the probability of accepting a move uphill (fitness difference of d 1 / 1 or d 2 / 2 ) is Ω(1). This is a necessary and sensible condition as otherwise SSWM struggles to climb uphill.
The upper and lower bounds in Theorem 11 are not tight because of the terms ( 1 + 1)/ 1 and ( 1 − 1)/ 1 in the exponents, respectively. However, both these terms converge to 1 as 1 grows. The running time, particularly the term e 2Nβd 1 ( 1 +1)/ 1 , crucially depends on βd 1 , the depth of the valley after scaling. Note that the condition βd 1 / 1 = Ω(1) is equivalent to βd 1 = Ω( 1 ), hence Theorem 11 applies if the depth after scaling is at least of the same order of growth as the length (recall that d 1 and 1 may grow with n).
Theorem 11 also indicates how to choose β according to the valley function in hand, in order to meet the theorem's condition and to minimise the (upper bounds on the) running time. One can always choose β = ε 1 /d 1 for some constant ε > 0 and any valley structure (even when 1 = ω(d 1 )). This way the theorem's condition becomes βd 1 / 1 = ε and the running time simplifies to O n · e 2Nβε( 1 +1) + Θ(n · 2 ), where we can choose the constant ε > 0 as small as we like. For N = O(1) we can further simplify the runtime to O n · e O( 1 ) + Θ(n · 2 ). For all 1 ≥ 2 (and reasonable 2 ) this is asymptotically smaller that the expected optimisation time of the (1 + 1) EA, which is at least Ω(n 1 ) = Ω(e 1 ln n ) (see Theorem 3).

Proof of Theorem 11
The first part of the proof consists of estimating E T 1→ 1 by using the statement of Lemma 7. Then we will check that the conditions from Lemma 5 are met and we will add the Θ( 2 ) term. Finally, we will take into account the time needed for a relevant step in the long path to obtain the n factor in the bounds (see Lemma 2).
As just described above we start considering E T 1→ 1 by using Lemma 7. Let us start with the upper bound.
Using Lemma 10 we bound p GR 1→ 1 yielding Since p fix for Δf < 0 decreases when the parameters N , β and |Δf | increase and Using Lemma 9 to lower bound p 0→1 we get Using (N − 1)βd 1 / 1 = Ω(1) and βd 1 / 1 = Ω(1) both terms to the denominator are Ω(1) leading to We now consider the lower bound. Starting again from Lemmas 5 and 7 and bounding p GR 1→ 1 with Lemma 10 Now we need to apply Lemma 5 to add the Θ( 2 ) term in both bounds. We start checking that all the conditions are satisfied. Firstly, since p fix for Δf > 0 increases when the parameters (N , β and Δf ) increase, then Nβd 2 / 2 = Ω(1) implies p 1 → 1 +1 = Ω(1). Analogously for p 1 → 1 −1 with Nβd 1 / 1 = Ω(1) satisfying the first property. Secondly, property (ii) follows directly from Lemma 8 and the condition Nβd 2 / 2 = Ω(1). The third property is satisfied since for N > 1 the acceptance probability is strictly increasing with Δf . Considering the time for a relevant step from Lemma 2 completes the proof.

Application to Metropolis
We now apply the framework from Sect. 3.2 to the Metropolis algorithm. Since the analysis follows very closely the one of SSWM the proofs for this subsection are provided in the appendix. We first cast Metropolis on Valley as a Gambler's Ruin problem. Like SSWM, Metropolis can make use of its parameter α to accommodate the gradient of Valley. Lastly, we make use of the previous lemma and the framework presented in Sect. 3.2 to determine bounds on the runtime of Metropolis for Valley. Note that the required conditions are similar to those from Theorem 11 for the SSWM algorithm, with only difference being that the parameter α substitutes the selection strength β. Hence the previous considerations for SSWM translate to Metropolis on Valley by simply applying β ← α.

Crossing Concatenated Valleys
We define a class of functions called ValleyPath consisting of m consecutive valleys of the same size. Each of the consecutive valleys is shifted such that the fitness at the beginning of each valley is the same as that at the end of the previous valley (see Fig. 3). Fitness values from one valley to the next valley increase by an amount of Here 0 < j ≤ m indicates a valley while 0 ≤ i ≤ 1 + 2 = indicates the position in the given valley. Hence, the global optimum is the path point P m· .
ValleyPath represents a rugged fitness landscape with many valleys and many local optima (peaks). It loosely resembles a "big valley" structure found in many realworld problems [1,19,22,30]: from a high-level view the concatenation of valleys indicates a "global" gradient, i. e. the direction towards valleys at higher indices. The difficulty for optimisation algorithms is to overcome these many local optima and to still be able to identify the underlying gradient. We show here that both SSWM and Metropolis are able to exploit this global gradient and find the global optimum efficiently. Note that ValleyPath is a very broad function class in that it allows for many shapes to emerge, from few deep valleys to many shallow ones. Our results hold for all valley paths with d 1 < d 2 .
As in the analysis for Valley, instead of considering the whole Markov chain underlying ValleyPath we take a high-level view and consider the chain that describes transitions between neighbouring peaks. Since the peaks have increasing fitness, this chain is quite simple and allows for an easy application of drift arguments. By choosing the number of peaks to the right of the current peak as distance function, the next theorem shows that, if we can find constant bounds for the drift, we will only need to repeat the Valley experiment for as many peaks as there are in ValleyPath.

Theorem 14 Consider any algorithm described by Algorithm 1 with local mutations on ValleyPath. Consider the points in time where the algorithm is on a peak and focus on transitions between different peaks. Let X t be a random variable describing the number of peaks to the right of the current valley at the t-th time a different peak is reached. If the drift over peaks Δ can be lower bounded by some positive constant
then the expected number of function evaluations E T f to reach the optimum starting from any peak is

where m is the number of valleys that compose ValleyPath, and E T O Valley and E T Ω
Valley are the upper and lower bounds for Valley respectively. Proof The lower bound is trivial since the algorithm can only move to a neighbour peak and has to visit m peaks. The upper bound follows from the standard additive drift theorem [10,18].
To compute the drift over the peaks Δ [see Eq. (9)] needed to use the previous theorem we perform a slightly different abstraction over the ValleyPath problem. We will also consider, apart from the peaks (local maxima), the points of minimal fitness between them (valleys). For simplicity we will use the following notation.  We can break down this term using the four probabilities from Definition 3. We

Application for SSWM and Metropolis
In the next two theorems we apply the previous results on ValleyPath to the SSWM and Metropolis algorithms. The application is straightforward when making the parameter λ = Ω(1) and the depths of the valley (d 1 and d 2 ) differ in some positive constant.
Notice that it could be the case that d 2 −d 1 is smaller than a constant but the parameters are big enough to compensate for this effect and still have a positive drift over peaks. However this increase in the parameters will affect the optimisation time between peaks (i.e. the Valley problem). Note that, by applying Theorem 3, it is easy to see that the runtime of the (1 + 1) EA will be exponential in the length of the individual valleys, hence the algorithm will be efficient only for valley paths consisting of valleys of moderate length. The remaining conditions that Theorems 17 and 18 require are those already required on the analysis for Valley (see Theorems 11 and 13).

Theorem 17 The expected number of function evaluations E T f for SSWM to reach the optimum starting from any peak on
and Proof Due to Lemma 8, SSWM meets the exponential ratio property needed by Lemma 16 with λ = 2β(N − 1). Then we can say that An equivalent result to that of the SSWM for ValleyPath is shown for Metropolis in the following theorem.
Proof The proof follows exactly as the proof of Theorem 17 with the only difference that λ = α [see Eq. (2)].
Note that our approach can be extended to concatenations of valleys of different sizes, assuming d 1 < d 2 for each valley. In this case the expression of the runtime would be dominated by the deepest valley.

Conclusions
We presented an analysis of randomised search heuristics for crossing fitness valleys where no mutational bias exists and thus the probability for moving forwards or backwards on the path depends only on the fitness difference between neighbouring search points. Our focus was to highlight characteristics of valleys where an elitist selection strategy should be preferred to a non-elitist one and vice versa. In particular, we compared the (1 + 1) EA using standard bit mutation with elitism against two algorithms using local mutations with non-elitism, namely SSWM and Metropolis. To achieve our goals we presented a mathematical framework to allow the analysis of non-elitist algorithms on valleys and paths of concatenated valleys. We rigorously proved that while the (1 + 1) EA is efficient for valleys and valley paths up to moderate lengths, both SSWM and Metropolis are efficient when the valleys and valley paths are not too deep. A natural direction for future work is to extend the mathematical framework to allow the analysis of SSWM with global mutations, thus highlighting the benefits of combining both non-elitism and global mutations for overcoming local optima.