DNA Replication and Strand Asymmetry in Prokaryotic and Mitochondrial Genomes

Different patterns of strand asymmetry have been documented in a variety of prokaryotic genomes as well as mitochondrial genomes. Because different replication mechanisms often lead to different patterns of strand asymmetry, much can be learned of replication mechanisms by examining strand asymmetry. Here I summarize the diverse patterns of strand asymmetry among different taxonomic groups to suggest that (1) the single-origin replication may not be universal among bacterial species as the endosymbionts Wigglesworthia glossinidia, Wolbachia species, cyanobacterium Synechocystis 6803 and Mycoplasma pulmonis genomes all exhibit strand asymmetry patterns consistent with the multiple origins of replication, (2) different replication origins in some archaeal genomes leave quite different patterns of strand asymmetry, suggesting that different replication origins in the same genome may be differentially used, (3) mitochondrial genomes from representative vertebrate species share one strand asymmetry pattern consistent with the strand-displacement replication documented in mammalian mtDNA, suggesting that the mtDNA replication mechanism in mammals may be shared among all vertebrate species, and (4) mitochondrial genomes from primitive forms of metazoans such as the sponge and hydra (representing Porifera and Cnidaria, respectively), as well as those from plants, have strand asymmetry patterns similar to single-origin or multi-origin replications observed in prokaryotes and are drastically different from mitochondrial genomes from other metazoans. This may explain why sponge and hydra mitochondrial genomes, as well as plant mitochondrial genomes, evolves much slower than those from other metazoans.


INTRODUCTION
DNA strand asymmetry refers to the differential distribution of nucleotides between the two DNA strands, e.g., one has more A or C than the other. This implies a violation of Chargaff's parity rule 2 [1], i.e., A = T and C = G within each strand. Consequently, strand asymmetry is typically measured by nucleotide skews such as GC skew and AT skew [2][3][4][5][6][7][8][9], referred hereafter as S G and S A : Chargaff's parity rule 2 may be satisfied at the genomic level in spite of strong local strand asymmetry. For example, Bacillus subtilis studied by Chargaff and his colleagues [1] has its genomic nucleotide frequencies being 28.18%, 21.81%, 21.71%, and 28.30% for A, C, G and T, according to the genomic sequence deposited in GenBank (NC_000964). Thus, both S G and S A are close to 0 (S G = -*Address correspondence to this author at the Department of Biology and Center for Advanced Research in Environmental Genomics, University of Ottawa, 30 Marie Curie, P.O. Box 450, Station A, Ottawa, Ontario, K1N 6N5, Canada; Tel: (613) 562-5800 ext. 6886; Fax: (613) 562-5486; E-mail: xxia@uottawa.ca 0.0021, S A = 0.0023). However, B. subtilis genomic DNA exhibits strong local asymmetry (Fig. 1). The asymmetry differs between the leading and the lagging strands, with the leading strand having more G than C and the lagging strand more C than G [2]. The strand compositional asymmetry is strong enough to allow the identification of the bacterial origin of replication ( Fig. 1) whose flanking sequences change direction in GC skew [2,[10][11][12][13][14] or in the components of the Z-curve [11,[15][16][17]. For this reason, strand asymmetry is often computed locally instead of globally, with the nucleotide skews computed with a sliding window. The validity and effectiveness of the in silico methods using strand asymmetry to identify the origin of replication in prokaryotic species are well established by many experimental verifications of the predicted replication origins [18] and the utility of these methods in practice has been demonstrated by many recent studies on prokaryotic genomes [15,17,[19][20][21][22][23][24], mitochondrial genomes [25][26][27][28] and plasmid genomes [16].
The nucleotide skews in Eq. (1) were extended in two ways. The first leads to the cumulative skew [8] which is based on summation of adjacent skew values and is equivalent to nucleotide skews with a wider sliding window than what is used to compute individual skew values. For example, the Mycoplasma pneumoniae genome has been used to illustrate the advantage of the cumulative skew method which detects the replication origin while the original GC skew method does not [8]. The real difference is not really between the two methods, but between two different widths of the sliding window (Fig. 2). A wide sliding window detects a clear change in polarity of strand asymmetry ( Fig. 2A), but a narrow window fails to (Fig.  2B). In this review, the window size for skew plots is optimized with the criterion that the autocorrelation between the GC skew values of neighboring sliding windows is maximized. This method is implemented in DAMBE [29,30].
The second extension of the nucleotide skews is to the word or motif skew [31] which is defined as Fig. (1). Nucleotide skew plot for the Bacillus subtilis genome (NC_000964), with window size = 537179 and step size = 21078. Each data point is at the beginning of its sliding window. The replication origin is identified as the genomic site where the GC skew (S G ) changes from negative to positive and the replication termination is the site where S G changes from positive to negative.   (2) where m is either a nucleotide (e.g., G or A) or a motif (e.g., ACG), m rc is the reverse complement of m (m rc = C if m = G, or m rc = CGT if m = ACG), and N x is the number of x in the sliding window (where x is either m or m rc ). GC skew and AT skew are special cases of S m when m is equal to either G or A, respectively, i.e., GC Skew is S G and AT skew is S A . It is for this reason that I have denoted GC Skew and AT Skew by S G and S A , respectively, in Eq. (1).
While transcription is known to contribute to strand asymmetry [32,33], the most important contributor to strand asymmetry is DNA replication associated with differential strand-specific mutation bias [21,22,34], which is confirmed by a study that assesses the contribution of both transcription and replication to strand asymmetry [23]. Because different replication mechanisms often lead to different patterns of strand asymmetry, much can be learned of replication mechanisms by examining strand asymmetry. In this review I will summarize the different patterns of strand asymmetry in different prokaryotic and mitochondrial genomes as a basis to infer the mechanism of DNA replication that gives rise to the diversity of strand asymmetry patterns. Based on the empirical evidence, I argue that (1) the common assumption of the single-origin DNA replication in bacterial species may not be valid because bacterial genomes from the endosymbionts Wigglesworthia glossinidia and Wolbachia (from Drosophila melanogaster) exhibit patterns of strand asymmetry strongly indicative of multiple origins of replication, (2) different replication origins in some archaeal genomes leave quite different patterns of strand asymmetry, suggesting that different replication origins in the same genome may be differentially used, (3) the pattern of strand asymmetry from mammalian mitochondrial genomes is consistent with the strand-displacement model of replication well documented in mammalian mitochondria [35][36][37][38][39][40], and this pattern is shared among mitochondrial genomes from representative vertebrate species, suggesting a similar DNA replication mechanism among vertebrate mitochondrial genomes, and (4) primitive forms of metazoans such as sponge and hydra, as well as plants, have mitochondrial strand asymmetry patterns similar to prokaryotes and drastically different from higher metazoans, suggesting that mitochondrial genomes in plants and in primitive invertebrate such as sponge and hydra share the a similar replication mechanism as their bacterial ancestor with a much lower replication error rate than that in mammalian mitochondrial genomes whose strand-displacement replication is highly error-prone. This sheds light on why   mitochondrial genomes from mammals evolve much faster than those from sponge, hydra and plants.

DNA REPLICATION AND STRAND ASYMMETRY IN PROKARYOTIC GENOMES
It is generally assumed that bacterial genomes have a single origin of replication [41,42] whereas archaeal genomes tend to have multiple origins of replication [43,44]. However, experimental verification of the exact number of replication origins is difficult and only a handful of prokaryotic species have their replication origins experimentally verified. Comparison of strand asymmetry patterns can shed lights on different replication mechanisms because different types of DNA replication typically lead to different patterns of strand asymmetry.

BACTERIAL GENOMES
Many studies have documented strand asymmetry in eubacterial genomes associated with their single-origin mode of genome replication [2,9,[45][46][47]. In general, there is an excess of G in the leading strand in many prokaryotic genomes examined [8,17,[48][49][50][51], with the bias generally attributed to strand-biased deamination of C to U or m 5 C to T [9,45,[52][53][54]. However, the distributions of nucleotides A and T along the leading and the lagging strands are much less consistent (Fig. 3) as has been documented before [17]. For this reason, S G has been used much more frequently in in silico identification of the replication origin and termination than S A .
In general, the pattern of S G is highly consistent with the single-origin replication across a diverse array of bacterial species. This has led to the common assumption that all bacterial genomes replicate with a single origin. The assumption is reinforced by the strong conservation of the molecular machinery for bacterial DNA replication. For example, the DNA replication initiation factor DnaA protein from a marine cyanobacterium (Prochlorococcus marinus CCMP1375) can specifically recognize the chromosomal origin of replication (oriC) of both E. coli and B. subtilis [55]. Thus, given that many bacterial genomes are known to replicate with a single origin of replication, and that all bacterial genomes may be replicated the same way, it is natural for us to assume that all bacterial genomes replicate with a single origin of replication.
The pattern of strand asymmetry in Fig. (3), however, is not universal among bacterial species (Fig. 4). The possibility of multiple origins of replication is particularly strong in the AT-rich genome of two endosymbionts: Wigglesworthia glossinidia in tse-tse flies (Glossina brevipalpis) and Wolbachia in Drosophila melanogaster (Fig. 4). The nucleotide skew plots with multiple changes of polarity are similar to that for the yeast (Saccharomyces cerevisiae) chromosome 1 replicated with multiple origins of replication (Fig. 5). Thus, the assumption of single-origin replication in bacteria [41,42] may be questionable.
There is no strong theoretical reason against some bacterial species having multiple origins of replication, other than the probably far-fetched possibility that daughter genomes arising from multiple origins of replication may fail  to segregate properly into the two daughter cells.
Escherichia coli genomes with an additional oriC inserted about 1 Mb apart from the regular oriC position seem to replicate normally, with both replication origins functioning identically and with no detectable difference in generation time or cell morphology from the wild-type cells [56]. This implies that, if mutation leads to the creation of an additional ectopic replication origin in an E. coli cell, there may be no strong selection against the mutant.
While multiple origins of replication typically would lead to multiple changes in polarity in the nucleotide skew plot, one should be careful in inferring multiple origins of replication based only on the observation of multiple changes in polarity in the nucleotide skew plots, because multiple changes in polarity can result from a variety of factors. For example, horizontal gene transfer is frequent in bacterial species, and a horizontally transferred sequence segment is likely to have quite different strand asymmetry patterns from the host genome, leading to additional changes in polarity in the skew plots. In other words, multiple changes in polarity in the skew plots may not result from multiple origins, but may instead result in the recent incorporation of multiple horizontally transferred genes. Similarly, there might be heterogeneity in strand asymmetry among different genes. For example, RNA genes typically form extensive secondary structure in which stems are double stranded and requires A=T and C=G (except for cases of U/G pairs in RNA). This implies that RNA genes should have different strand asymmetry patterns than the rest of the genomes, leading to additional changes in polarity in the skew plot. Also, if an rRNA gene cluster is duplicated in the opposite strand (which is the case for Wigglesworthia glossinidia), and if the rRNA is highly conserved (which is also true in W. glossinidia), then the recipient strand will have an irregular skew value at the position of the new rRNA genes.
To alleviate these potential problems, I have generated the skew plots that included or excluded the protein-coding and rRNA genes. Such treatments do not alter the pattern of nucleotide skews in Fig. (4). While the pattern in S G is indicative of multiple origins of replication (Fig. 4), it is difficult to exclude alternative explanations. If genes switch strands frequently, then the strand asymmetry will be weak with multiple shallow peaks/valleys. This problem is particularly relevant to Wolbachia because of its mosaic genomic structure resulting from extensive recombination. My point is to highlight what is unresolved for future studies.
In the cyanobacterium Synechocystis sp. 6803, S G exhibits no recognizable change of polarity for any width of the sliding window. Its dnaA gene is located at sites 1350236..1351579 where no change in polarity of the strand asymmetry was observed in nearby sequence regions (Fig.  6A). While S A decreases and increases dramatically (Fig.  6A), its change is typically not indicative of the origin of replication. The nucleotide skew plot in Fig. (6A) does not favor the hypothesis that the Synechcystis sp. 6803 genome has a single origin of replication that is fired consistently in all genome replications.
The nucleotide skew plots for the AT-rich Mycoplasma pulmonis genome (Fig. 6B) also do not suggest a single origin of replication because of multiple S G changes in polarity. Instead of a sharp change in polarity, there is a long stretch of the genome with S G values hovering above and below the zero line (Fig. 6B). The genome contains many  putative DnaA boxes [57], which is expected given the ATrichness of the genome. The genome is also peculiar in that a plasmid carrying an oriC would, after only a few passages, integrate into the predicted genomic oriC region [57]. This could lead to multiple origins of replication clustered together, with each having the potential to fire during genome replication. Such a hypothesis would potentially explain why there is a long stretch of genomic DNA with S G values close to zero (Fig. 6B), i.e., no strand asymmetry can be established within genomic regions with closely spaced multiple replication origins.
The bacterial oriC is AT-rich and is expected to occur more frequently in AT-rich genomes. This suggests that ATrich genomes have a greater tendency to harbor multiple origins of replication than GC-rich genomes. In this context, it is interesting to note that the bacterial species with a strong multi-origin replication signature in their strand asymmetry patterns, i.e., Mycoplasma pulmonis, Wigglesworthia glossinidia and Wolbachia are highly AT-rich genomes.
What bacterial genome would benefit from having multiple origins? If the genome is extraordinarily long, if the replication process is slow, or if the replication machinery (DNA-replication initiation and elongation proteins and enzymes) can be produced cheaply in multiple copies, then multiple replication origins would seem beneficial. Genomic data are available to address such a question.
Another point worth making in bacterial nucleotide skew plots is the diversity in the relationship between S G and S A (Figs. 1-4, 6). This diversity is unexpected given the common proposal that the main contributor to strand asymmetry is the strand-biased deamination of C to U or m 5 C to T during DNA replication [9,45,[52][53][54]. If the strand asymmetry is maintained mainly by the C U/T mutations, then we expect a negative relationship between S G and S A , because reductions in C and increases in T will cause both an increase in S G and a decrease in S A . Such a negative correlation is indeed observed in Buchnera aphidicola genome (not shown), but a strong positive correlation between S G and S A is also observed (e.g., all genomes in the genus Bacillus). Such a positive correlation cannot be explained by the pure C U/T mutation bias [24,58].

ARCHAEAL GENOMES
Multiple replication origins are typically assumed for archaeal genome replication [43,44,59]. Multiple origins of replication implies multiple changes in polarity in nucleotide skew plots, which is well exemplified by several archaeal species with experimentally verified multiple origins of replication (Fig. 7). Sulfolocus salfataricus and S. acidocaldarius both have three origins of replication [60,61]. It is noteworthy that the S G curve in S. acidocaldarius (Fig. 7A) has valleys of different depths, similar to that for the yeast chromosome 1 (Fig. 5). These valleys of different depths suggest that some replication origins are fired more frequently than others, leading to stronger strand asymmetry than other replication origins. In eukaryotes, different replication origins are not used synchronously or equally frequently [62]. This may also be true for archaeal replication origins. Differential usage of different replication   origins has been documented in Haloferax volcanii [63]. In any case, the S G pattern in Fig. (7A) casts doubt on the claim that the three replication origins in Sulfolocus species fire synchronously in each cell cycle [61].
The genome of Aeropyrum pernix contains two verified origins of replication, which is consistent with the S G plot (Fig. 7C). The different peaks and valleys again suggest different firing frequencies of different origins of replication. The two origins share some homology with two of the three replication origins in Sulfolocus species [42]. This raises the question of how Sulfolocus species acquired their third replication origin, i.e., whether it arose by accumulated mutations in the genome or whether it is acquired by capturing extrachromosomal element. The finding of a viral integrase element near the replication origins lends support for the latter [42].
The main chromosome of the halophilic archaeon Haloferax volcanii (which also has three smaller replicons) contains two origins of replication [63], which is also suggested by the two major changes in polarity in the S G plot (Fig. 7D). The origin of replication has not been identified in the Methanococcus jannaschii genome, but the multiple changes in polarity in the S G plot (Fig. 8A) from the genome strongly suggest multiple origins of replication. The genome also exhibits multiple peaks and valleys in marker frequency distributions [64], consistent with the interpretation of multiple origins of replication. The shared feature of multiple replication origins among these taxonomically diverse archaeal species suggests that multi-origin replication is the norm in Archaea.
Previous studies suggest only a single origin of replication in the genomes of three archaeal species: Pyrococcus abyssi [65,66], Archaeoglobus fulgidus [64], and Halobacterium NRC1 [67]. While the S G plot of Halobacterium NRC1 is consistent with a single-origin replication (Fig. 8D), the S G plot for A. fulgidus has two peaks, suggesting two putative replication origins.

DNA REPLICATION AND STRAND ASYMMETRY IN MITOCHONDRIAL GENOMES
Mitochondrial DNA (mtDNA) replication has been studied most thoroughly in mammals. Mammalian mtDNA has two strands of different buoyant densities and consequently named the H-strand and the L-strand. The two Fig. (7). Nucleotide skew plots for genomes of (A) Sulfolobus acidocaldarius (NC_007181, window size = 317575, step size = 11129), (B) Sulfolobus solfataricus (NC_002754), window size = 413369, step size = 14961), (C) Aeropyrum pernix (NC_000854, window size = 238220, step size = 8348), and (D) Haloferax volcanii (NC_013967), window size = 405353 and step size = 14238. The species also contains three smaller replicons whose nucleotide skew plots are not shown).  Site Sm strands have different nucleotide frequencies, with the Hstrand rich in G and T and the L-strand rich in A and C, which strongly affects the codon usage of genes on the two strands [28]. This strand asymmetry can be well explained by the strand-displacement model of mtDNA replication [35][36][37][38][39][40].
During mtDNA replication, the L-strand is first used as a template to replicate the daughter H-strand, starting at the origin of replication O H , while the parental H-strand was left single-stranded for an extended period because the complete replication of mtDNA takes nearly two hours [35][36][37]. After about 2/3 of the daughter H-strand has been synthesized and the second origin of replication (O L ) is exposed, the parental H-strand is used as a template to synthesize the daughter Lstrand. Thus, different parts of the H-strands are in singlestranded form for different periods of times.
Spontaneous deamination of both A and C [52,53] occurs frequently in human mtDNA [68]. Deamination of A leads to hypoxanthine that pairs with C, generating an A/T G/C mutation. Deamination of C leads to U, generating C/G U/A mutations. Among these two types of spontaneous deamination, the C U mutation occurs more frequently than the A G mutation [53]. In particular, the C U mutation mediated by the spontaneous deamination occurs in single-stranded DNA more than 100 times as frequent as double-stranded DNA [54]. Note that these C U sites will immediately be used as template to replicate the daughter L-strand, leading to a G A mutation in the Lstrand after one round of DNA duplication. Such mutation patterns are expected to leave their footprints on different parts of the H-strands left single-stranded for different periods of time.
While experimental evidence for the strand-displacement model is limited to mammalian species, the nearly identical pattern of strand asymmetry among representative vertebrate species (Fig. 9) suggests that the replication mechanism is most likely shared. The reduction in S G correspond to the reduction of C in the H strand (and the associated G in the L strand), allowing us to infer the location of replication origins O H and O L (Fig. 9).
The pattern of strand asymmetry among mitochondrial genomes in vertebrate species is dramatically different from those of prokaryotic species or the yeast (Figs. 1-8). In particular, the S G values for the vertebrate species are all negative (and would be all positive for the complementary strand), in contrast to the S G values of prokaryotic species which fluctuate above and below the zero line. This suggests not only local strand asymmetry, but also global strand asymmetry in vertebrate mitochondrial genomes. This is confirmed by the genomic S G , computed from genomic C   and G frequencies from representative vertebrate mitochondrial genomes ( Table 1). Invertebrate mitochondrial genomes also exhibit consistent and strong global strand asymmetry ( Table 1), except for the most primitive ones such as the sponge (Oscarella lobularis) and the hydra (Hydra oligactis), representing Porifera and Cnidaria, respectively. The sponge and hydra mtDNAs have S G values similar to those in plant mtDNA. The two animal groups they represent are also similar to plants in having slower evolutionary rates in their mtDNA than in their nuclear genomes [69], in contrast to other metazoans whose mtDNA evolves much faster than their nuclear genomes. As evolutionary rate is largely determined by mutations introduced during DNA replication, one would expect that mtDNA in plants and in primitive invertebrates such as Porifera and Cnidaria should have DNA replication different from the strand-displacement model established for mammalian mtDNA. The nucleotide skew plots (Figs. 9, 10) are consistent with this suggestion.
The pattern of mtDNA strand asymmetry in higher plants (e.g., Oryza sativa and Cycas taitungensis), as characterized by the S G plots (Fig. 10A-B), suggests multiple origins of replication with the S G curve sharply crossing the zero line multiple times. This is similar to those observed in eukaryotic nuclear genomes or in archaeal genomes with multiple replication origins. Interestingly, for primitive forms of plants such as the liverwort Marchantia polymorpha, or primitive forms of metazoans such as the sponge Oscarella lobularis, the pattern of strand asymmetry ( Fig. 10 C-D) is indistinguishable from what is typically seen in bacterial genomes with a single origin of replication. The S G plot of the Hydra oligactis mitochondrial genome is similar to that of Oscarella lobularis except for a slightly more pronounced secondary peak. All these patterns of strand asymmetry is dramatically different from those observed in vertebrate mtDNA (Fig. 9) and may explain the extremely slow rate of evolution between plants/sponge and higher metazoans. In other words, mitochondrial genomes in plants and primitive invertebrates may maintain the highfidelity replication in their bacterial ancestor, whereas the error-prone strand-displacement replication evolved, likely as a secondary consequence of some advantageous traits, in a lineage leading to vertebrate mitochondrial genomes. The diversification of mtDNA replication mechanisms has not been thoroughly explored in the context of evolution.
In summary, patterns of strand asymmetry are diverse among different taxonomic groups and can tell us much about the molecular mechanism of DNA replication. The single-origin replication may not be universal among bacterial species as the endosymbionts (Wigglesworthia glossinidia, and Wolbachia species), the cyanobacterium Synechocystis 6803, and Mycoplasma pulmonis all have their   genomes exhibiting strand asymmetry patterns consistent with the multi-origin mode of replication. Different replication origins in some archaeal genomes leave quite different patterns of strand asymmetry, suggesting that different replication origins in the same genome may be differentially used. Vertebrate species share one strand asymmetry pattern consistent with the strand-displacement replication documented in mammalian mtDNA, suggesting that the mtDNA replication in mammals may be universal among vertebrates. Mitochondrial genomes from primitive forms of metazoans such as the sponge and hydra, as well as those from plants have strand asymmetry patterns similar to the single-origin or multi-origin types of DNA replication observed in prokaryotes. This may explain why sponge and hydra mtDNA, as well as plant mtDNA, evolves much slower than other metazoan mtDNA.
I should finally emphasize the importance of using statistical criteria when referring to peaks or changes in polarity in the skew plots. Take S G for example, the standard deviation has been formulated as [2]: A peak in the S G plot therefore refers specifically to a peak that protrude above the line of mean S G +1.96s, and a valley below the line of mean S G -1.96s, assuming the 0.05 significance level and that the window is sufficiently wide for the distribution of S G approximating the normal distribution. I encourage all programmers to include the 95% confidence intervals for nucleotide or word skew plots.

ACKNOWLEDGEMENT
This study is supported by NSERC's Discovery Grants and the CAS/SAFEA International Partnership Program for Creative Research Teams. This project was completed when I was on sabbatical in Prof. C. Primmer's laboratory in University of Turku.

APPENDIX 1
How to generate nucleotide skew plots in DAMBE

4.
In the ensuing 'Process GenBank File' dialog, the default is 'Whole sequence'. Keep the default and click the 'OK' button.

5.
In the next dialog, the default is 'Non-protein nuc. seq'. Keep the default and click the 'Go' button. The sequence will be displayed 6. Click 'Seq.Analysis|Genome|GC Skew'. In the ensuing dialog, check the 'Circular genome' checkbox and click 'Go' button.

7.
Two plots will be generated, one for S G and one for S A . The window-specific data underlying the plots are also displayed.