mRNA codon optimization on quantum computers

Reverse translation of polypeptide sequences to expressible mRNA constructs is a NP-hard combinatorial optimization problem. Each amino acid in the protein sequence can be represented by as many as six codons, and the process of selecting the combination that maximizes probability of expression is termed codon optimization. This work investigates the potential impact of leveraging quantum computing technology for codon optimization. An adiabatic quantum computer (AQC) is compared to a standard genetic algorithm (GA) programmed with the same objective function. The AQC is found to be competitive in identifying optimal solutions and future generations of AQCs may be able to outperform classical GAs. The utility of gate-based systems is also evaluated using a simulator resulting in the finding that while current generations of devices lack the hardware requirements, in terms of both qubit count and connectivity, to solve realistic problems, future generation devices may be highly efficient.


Introduction
Protein sequences can be encoded by an enormous multitude of possible nucleotide sequences. The degenerate mapping between amino acids and synonymous codons entails an exponential relationship between the number of potential nucleotide sequences and the length of the polypeptide chain. However, different nucleotide sequences encoding the same protein may exhibit dramatically different outcomes in expression systems. [1][2][3][4] Furthermore, recent studies have shown that codon selection can impact downstream processes such as protein folding and function, 1-3 which is particularly important for use-cases such as recombinant protein therapies. 5 Codon optimization is a procedure designed to increase gene expression based on a heuristic scoring function 6 with many scoring functions having been proposed. [7][8][9][10] Some of the more common scoring functions seek to optimize the fraction of G and C bases, [11][12][13][14][15] match the codon usage bias of the host expression system, [16][17][18][19][20][21] and/or attempt to disrupt the formation of mRNA secondary structure. 7,17,22 The vast solution space is most commonly sampled using genetic algorithms (GA) that seek to evolve solutions by introducing synonymous codon mutations and propagating favorable substitutions through generations. 12,16,20,[23][24][25] However, other methods have been proposed. 18,26,27 While classical approaches such as GAs can be highly performant, the fraction of solution space that is sampled in a fixed number of iterations decreases exponentially as the polypeptide chain length grows. Thorough sampling of the solutions space is therefore often intractable with biologically relevant use-cases. In this study we An alternative quantum computing technology, and the technology that was recently used to demonstrate quantum supremacy, 28,29 centers on gate-based instructions. Current generations of non-error corrected hardware, termed Noisy Intermediate Scale Quantum (NISQ), 36 can be programmed to solve combinatorial optimization problems using variational methods such as the Quantum Approximate Optimization Algorithm (QAOA). 37 Gate based quantum computers are presently, however, less mature than AQCs and even the most capable devices to date lack the number of qubits and connections between qubits needed to solve realistic combinatorial optimization problems. As the technology matures however there is an expectation that qubit count and connections between qubits will improve substantially.
There is also the expectation that error rates will decrease, and general-purpose error correction may become possible. Developing and testing suitable algorithms ahead of the technology development curve is thus a worthwhile endeavor and it is therefore common practice to utilize simulators, running on classical computing hardware, to permit evaluation of quantum algorithmic functions in the absence of real-world hardware to evaluate performance on.
Designing novel algorithms to execute on quantum devices requires deep expert knowledge of the devices, quantum information science, and quantum software stacks.
However, there are a few classes of ubiquitous problems that are readily solvable using tools built into many widely available software packages. The Binary Quadratic Model (BQM) is perhaps the most archetypal example and can be found at the core of many familiar problems such as the Ising model. 38,39 The energy of a BQM can be described by a Hamiltonian with the general form where qi, qj, and qk represent the values of the qubits, which can either be {0, 1} or {-1, 1} for binary or spin representations, respectively, hi are the one-body terms, and Jjk are the two-body interactions. For the Ising model, h represents the physical spins and J represents the energy of the interactions between the spins.
The methods section below shows that the codon optimization problem can be mathematically formulated as a BQM and thus implemented on a variety of competing quantum computing platforms. This representation requires translating the scoring function used in traditional approaches into a quadratic Hamiltonian where the eigenstates represent nucleotide sequences and the eigenenergies represent the scores. Implementing this program on the D-Wave Advantage 1.1 AQC shows that it identifies high quality solutions that are competitive with a basic implementation of a genetic algorithm programmed with an equivalent scoring function. Implementing a version of this program for IBM Q devices, while successful, shows that modelling practical systems requires too many qubits to be run on even the most advanced gatebased devices available (e.g. IBM's 65-qubit Hummingbird device). 40 However, executing the model on an IBM noisy simulator 41 demonstrates the potential of the algorithm. Finally, we comment on the potential usability of the BQM approach on current and future generations of quantum computers.

Results
Codon optimization was implemented as a BQM on quantum devices with a Hamiltonian designed to optimize GC-content, minimize sequentially repeated nucleotides, and optimize codon-usage bias (see equation (15) in Methods for details).
Quantum devices use qubits to store data, which decode digitally to 0's and 1's upon measurement, but which also may be in a superposition of 0 and 1 during the calculation. To encode classical genetic data into a quantum device, every possible codon that can map to the target polypeptide sequence is required to be explicitly represented by a physical qubit. The qubits which return "1" upon measurement represent the codons selected at each position in the polypeptide sequence. Therefore, only 1 qubit (codon) for each position in the polypeptide sequence can be in a "1" state, and the rest must return "0" upon measurement ( Figure 1a). This scheme is enforced by  to qubits. Gray boxes represent qubits labeled q0…q3, and the codon assigned to each qubit is shown next to the qubit label. (b) A penalty matrix is constructed to add infinite energy to cases where more than one codon is in the "1" state. In this example, qubit q2 is in the "1" state and the rest are in the "0" state, which returns an energetic penalty equal to 0. (c) Example mapping of codons for protein sequence GSK… to qubits. One codon is selected for each position in the sequence, highlighted in orange.

AQC accuracy and quality of scores
The goal of the optimization is to find the combination of codons that minimizes the Hamiltonian (or the objective function for the GA). In theory, AQCs should be able to find the ground state of the input Hamiltonian. However, due to thermal fluctuations and limited quantum processing unit (QPU) time, low energy solutions that are near, but not equal to the ground state are expected. Furthermore, as the size of the problem increases, the probability of annealing to an optimal eigenvalue decreases. See the "BQM challenges and limitations" section in the Supplementary Information for further discussion of technical challenges specific to heavily constrained problems such as codon optimization.
While it is not possible to calculate the true ground state for large problems, comparisons can be made to other approximate methods, e.g., GA approaches, to contextualize the results and performance of the AQC approach. Additionally, D-Wave offers hybrid solvers that augment the AQC with classical methods to optimize the results. Direct programming of the AQC yielded excessively noisy results with high variance in the estimation of the ground state. The hybrid solver was thus selected to more reliably optimize the input Hamiltonian.

Peptides of length 20
The baseline performance of the AQC implementation of codon optimization was

Full-length proteins
The Leap Hybrid solver is capable of solving codon optimization problems expressed as a BQM with up to ~1,000 amino acids. A selection of full-length sequences (see Test Applications in Methods) was run on both the AQC and the GA

Relationship between scoring function and scalability on quantum hardware
The choice of scoring function can have a dramatic impact on the size of the model. For example, the Hamiltonian representing GC-content requires explicit calculation of all off-diagonal elements of the two-body interaction matrix. In other words, all logical qubits must be fully connected. The D-Wave Advantage system is state-of-the-art in terms of number of qubits and connectivity, described by a P16 Pegasus graph, 42 and the best minor embedding scheme for a fully connected graph of logical qubits was heuristically found to support a maximum of 180 nodes when interfacing directly with the QPU. However, there are two-body Hamiltonians that do not require couplers between all logical qubits. The Hamiltonian penalizing sequentially repeated nucleotides only requires couplers between qubits mapping to neighboring sequence positions. In this case, the task of minor embedding is straightforward and would only require a few additional physical qubits to represent the required logical qubits on the D-Wave Advantage system.

Gate-based simulator results
The gate-based approach was simulated using the IBM Qasm noisy simulator, which can simulate up to 24 fully connected qubits. The BQM implementation was identical to the scheme described by the AQC (Figure 1), and the optimization was carried out using QAOA. 37 The codon optimization simulation was run on 313 peptides of length four and one peptide of length three, each requiring between 7-24 logical qubits with full connectivity. In the most resource-intensive scenario, there could be four amino acids in a row that each map to six qubits, and the algorithm is tasked with finding the ground state out of 6 4 = 1296 possible states using 24 qubits (the maximum supported by the simulator). The task is therefore small enough to compute the exact result with the NumPyMinimumEigensolver exact solver as a comparison to the simulation.
The simulated scores vs the exact scores are shown in Figure 4a. The quantum algorithm identified the exact solution for 59 peptides, and overall, 84% of the trials yielded valid results. However, there were 51 cases where the quantum algorithm returned results which had an invalid mapping between codons and qubits. In each of the invalid trials there was at least one amino acid position lacking a selected codon.
This type of error was more common in cases where more qubits were required to run the simulation (minimum 16 qubits), and therefore the optimization task was more difficult. The success of the calculation was dependent on the random seed used for the noisy simulator; rerunning trials that failed to return valid results with different random seeds changed the outcome and, in each case, a valid result was eventually found. The instability of the simulation and its dependence on the number of qubits required to run the simulation is an artifact not just of the simulated noise but also an inherent consequence of the variational algorithm employed by the simulator, which is not

Discussion
Codon optimization is a classic example of a biological problem with exponential scaling in solution space. While there exist classical machine learning and artificial intelligence methods that improve sampling, quantum technology may offer an alternative or complimentary approach to enhance the ability to identify optimal samples from these distributions. Current generations of quantum hardware are mature enough to test ideas for novel algorithmic approaches to problems in life sciences but are not yet capable of outperforming classical devices. In this study codon optimization is reformulated to be readily implementable on quantum devices and the viability of the method is demonstrated on both adiabatic and gate-based quantum computers.
The D-Wave Systems Advantage 1.1 system was heuristically determined to be capable of solving codon optimization problems programmed directly into the QPU up to approximately 180 codons with scoring functions requiring all logical qubits to be fully connected, which maps to 30 amino acids in the most resource intensive cases. If a scoring function is selected that does not require full connectivity, such as the Hamiltonian describing repeated nucleotides, then nearly all of the logical qubits could map directly to physical qubits and the system size could theoretically be scaled to accommodate sequences of up to ~1,000 amino acids. There is likely significant room for performance improvements in the quantum, hybrid, and classical methods. As AQC hardware matures it will be possible to address questions of scalability for larger sequences and critically assess the potential for quantum technology to surpass classical techniques.
The IBM Experience provides free access to small quantum devices and noisy simulators. The largest quantum device freely available, Melbourne, contains 15 qubits with 2-3 couplers between them. The BQM presented in this study requires full connectivity between the logical qubits. Melbourne is therefore able to represent a select set of 2-amino acid systems in which each amino acid maps to one or two codons. Meanwhile the IBM Qasm simulator allows up to 24 fully connected simulated qubits, which is generally limited to 4 amino acids. This simulation approach is sufficient to use QAOA to solve small proof of concept problems, but realistically thousands of qubits with high connectivity are required to run biologically relevant sequences. Given IBM's current public Roadmap for Scaling Quantum Technology, 40 devices with this capacity are not expected to be available to the public before 2024. Utilizing the simulator, the QAOA algorithm had variable performance, identifying the true ground state solution in nearly 20% of the trials, and failing to identify a valid solution in 16% of the trials. As noted above the success of the calculation was dependent on the random seed used for the noisy simulator; valid results for each failed run can be obtained by rerunning with different random seeds until a valid result is found. Further investigation on physical devices is required to determine the limit of the accuracy and precision of the method.
There is a vast body of literature discussing codon optimization techniques for protein expression. This study contributes to the field by offering a novel approach to sampling the vast solution space with an emerging technology. The considerations that went into the construction of the Hamiltonian were designed to highlight some of the implementational nuances associated with common scoring functions. For example, the expression for measuring optimality of GC-content requires full connectivity between qubits. However, counting repeated nucleotides between neighbors only requires some qubits to be coupled. Codon usage bias can be factored into the Hamiltonian using onebody terms, but one could imagine a more thorough approach which compares the distribution of codons selected in the sequence compared to the reference distribution, which would require two-body terms. Similarly, implementing the popular "one amino acid-one codon" strategy would require codons from each type of amino acid to be coupled, 18,21,43 adding many two-body terms but not as many as optimization of GC-content. Thus, the particular Hamiltonian studied here serves as a demonstration of the efficacy of the method at evaluating a realistically complicated objective function.
Further studies are required to evaluate these samples beyond simple numerical scores to determine the value this approach could add to the field of protein expression using future generations of quantum hardware. Current quantum hardware is subject to high levels of noise and is therefore not competitive with classical techniques in most practical applications. However, quantum hardware and techniques are advancing exponentially in terms of both scale and tolerance to noise. This rapid advancement coupled with the expectation that scaling with problem size is significantly reduced on quantum hardware compared with classical methods implies that future quantum approaches could provide a significant performance advantage for optimization of NP problems in life sciences.

Codon optimization algorithm
Classical scoring functions can be reinterpreted as a Hamiltonian by separating one-and two-body interaction terms. There is considerable research and ongoing debate of the proper way to score nucleotide sequences for expression, 5,8,[18][19][20][21][24][25][26][27]43,[10][11][12][13][14][15][16][17] as well as arguments against the use of codon optimization in some cases. 5 The purpose of this study is not to contribute to the discussion of whether or not codon optimization is an appropriate tool for any given context or the optimal way in which it should be performed but rather to investigate a novel method of sampling the vast solution space. The optimization task is therefore restricted to three considerations that capture simple countable properties from the sequences themselves. This includes discussion of how to formulate energetic terms in the Hamiltonian that serve to minimize: 1. Codon usage bias.

Incorporating codon usage bias
Codon usage frequency varies by host system. 44 Therefore, the scoring function is tailored to match the expression system. For this study, codon usage frequencies for e. coli are imported from the python-codon-tables 0.1.10 library. 45 Let Ci represent the frequency of finding codon C at position i. The potential is thus designed to return a large penalty for rare codons (where Ci is small) and incur a negligible penalty for codons readily available to the system (where Ci is large). One such function is the log of the inverse multiplied by -1. This function yields the desired behavior of adding large penalties to rare codons and adding small penalties to accessible codons. However, the function is undefined at Ci = 0. If a host system truly does not have access to a given codon, then any sequence containing the codon is not expressible, and the probability of expression would be zero. However, the scoring function must be restricted to finite decimal values, so an infinitesimal value, εf, is added to the denominator to avoid undefined values. For a system containing N possible codons, the Hamiltonian is given by: where cf is a tunable constant, qi ∈ {0,1} is the value of the qubit, and ζ is a vector containing the values of the log inverse codon usage frequencies. Given the binary qi values, the Hamiltonian only penalizes codons that are "selected", represented by qubits with value qi = 1.

Optimize target GC concentration
To optimize the GC concentration of a nucleotide sequence, rGC, a cost function ∆ must be introduced to minimize the difference between rGC and the target GC concentration, rT. The simplest objective function satisfying this constraint is a quadratic function, where cGC is a tunable constant. The GC content is calculated by summing the number of G's and C's in the sequence of length N and normalizing by the number of nucleotides in the sequence where si is an integer representing the number of G's and C's in codon i, and qi represents the value of qubit i. By expanding equation (3), a form similar to a Binary Quadratic Model (BQM) formulation becomes apparent: The matrix represented in the double sum needs to be restricted to a sum over the upper triangular elements, consistent with equation (1). By decomposing the sum into the trace and a term that sums the contributions of the off-diagonal elements, the sum can be restricted to the upper triangular elements. The trace requires a single summation over si 2 . Since qubits map to binary values, they are idempotent with themselves and therefore qi 2 = qi.
Since the matrix is symmetric, all off-diagonal terms are accounted for in a upper  (1)).
To negate the possibility of assigning more than one codon to a given position, a site-specific delta-function is introduced that applies an effectively infinite penalty to pairs of codons assigned to the same position. Expanding and rearranging the terms gives the form in equation (15): This form is consistent with the BQM formalism (equation (1)) and can be directly implemented into BQM frameworks.

Algorithm implementations
The current approach to performing calculations on quantum devices requires the interaction terms to be precomputed on classical devices and read into the quantum devices via specialized APIs. The one-and two-body interaction terms from equation (15) were precomputed in python 3.7 using standard libraries and numpy 46 arrays. The numpy arrays were converted to dictionaries in accordance with the expected input for the quantum device libraries.
The execution of the BQM was carried out using libraries described in the following sections. Each calculation was run 20 times for small peptide fragments (<100 residues), and full-length proteins were only run one time due to QPU resource constraints. Table 1 provides the values of the constants used in the objective function.
The eigenstate returned by the quantum devices was converted back to a nucleotide sequence and translated to a polypeptide sequence to verify the validity of the result.