Phylogenomic and Comparative Analyses of Complete Plastomes of Croomia and Stemona (Stemonaceae)

The monocot genus Croomia (Stemonaceae) comprises three herbaceous perennial species that exhibit EA (Eastern Asian)–ENA (Eastern North American) disjunct distribution. However, due to the lack of effective genomic resources, its evolutionary history is still weakly resolved. In the present study, we conducted comparative analysis of the complete chloroplast (cp) genomes of three Croomia species and two Stemona species. These five cp genomes proved highly similar in overall size (154,407–155,261 bp), structure, gene order and content. All five cp genomes contained the same 114 unique genes consisting of 80 protein-coding genes, 30 tRNA genes and 4 rRNA genes. Gene content, gene order, AT content and IR/SC boundary structures were almost the same among the five Stemonaceae cp genomes, except that the Stemona cp genome was found to contain an inversion in cemA and petA. The lengths of five genomes varied due to contraction/expansion of the IR/SC borders. A/T mononucleotides were the richest Simple Sequence Repeats (SSRs). A total of 46, 48, 47, 61 and 60 repeats were identified in C. japonica, C. heterosepala, C. pauciflora, S. japonica and S. mairei, respectively. A comparison of pairwise sequence divergence values across all introns and intergenic spacers revealed that the ndhF–rpl32, psbM–trnD and trnS–trnG regions are the fastest-evolving regions. These regions are therefore likely to be the best choices for molecular evolutionary and systematic studies at low taxonomic levels in Stemonaceae. Phylogenetic analyses of the complete cp genomes and 78 protein-coding genes strongly supported the monophyly of Croomia. Two Asian species were identified as sisters that likely diverged in the Early Pleistocene (1.62 Mya, 95% HPD: 1.125–2.251 Mya), whereas the divergence of C. pauciflora dated back to the Late Miocene (4.77 Mya, 95% HPD: 3.626–6.162 Mya). The availability of these cp genomes will provide valuable genetic resources for further population genetics and phylogeographic studies on Croomia.


Introduction
Croomia Torr. ex Torr. et Gray belongs to the monocot family Stemonaceae Engl (Pandanales, Liliidae) and comprises three herbaceous perennial species: C. pauciflora (Nutt.) Torr., C. japonica Miq. and C. heterosepala (Bak.) Oku. Of these three species, C. japonica and C. heterosepala are endemic to warm-temperate deciduous forests in East Asia, while C. pauciflora grows in temperate-deciduous forests in North America [1][2][3]. There is a considerable difference in morphological traits among this genus. For example, the four tepals of C. japonica are homomorphic with a re-curved edge, while those of C. heterosepala have a flat edge, and one outside tepal is much larger than the other three [4,5]. Compared to two Asian species, C. pauciflora has a smaller flower, shorter petiole, denser underground stem nodes and a more obvious heart-shape leaf base [1]. As the roots of Croomia species contain compounds such as pachysamine, didehydrocroomine and croomine groups, they are used as folk medicine to treat cough and injuries [6,7]. Croomia can reproduce sexually through seed formation via cross-pollination and asexually through underground rhizomes [1,8]. Due to their limited distribution and small population sizes, the three extant species of Croomia are listed as "threatened" or "endangered" in China, Japan and the Americas [9][10][11]. The other three genera of Stemonaceae are Pentastemona, Stemona and Stichoneuron. The species of Stichoneuron are located in India, Thailand and Peninsular Malaysia, while those of Pentastemona are only in Sumatra [3,8]. The genus Stemona comprises ca. 25 species with the widest distribution from Northeast Asia to Southeast Asia and Australia. The roots of Stemona species contain similar medicine compounds as Croomia [7]. Although Croomia and Stemona species have important pharmacological and ecological value, limited molecular markers were available for the utilization, conservation and breeding of these species in the context of population genetics and phylogenetic studies [12].
Croomia exhibits a well-known classic intercontinental disjunct distribution between Eastern Asia (EA) and Eastern North America (ENA) [1,8,13,14]. This continental disjunction pattern was suggested to have resulted from fragmentation of the mid-Tertiary mesophytic forest flora throughout a large part of the Northern Hemisphere, as global temperature cooled down in the late Tertiary and Quaternary [15,16]. For the two East Asian endemics, C. japonica is distributed in East China and southern Japan, while C. heterosepala is in northern Japan, and they have adjacent ranges in South Japan [17,18]. Therefore, Croomia is well suited for testing biogeographic hypotheses about the evolution of both the eastern Asian-eastern North American and eastern Asian-Japanese Archipelago floristic disjunctions. Based on previous molecular phylogenetic analyses using cpDNA sequence variation of the trnL-F region, the two Asian species were identified as sister species that likely diverged in the Mid-to-Late Pleistocene (0.84-0.13 million years ago, Mya), whereas the divergence of C. pauciflora dates back to the Late Plio-/Pleistocene (<2.6 Mya) [12]. However, the previous cpDNA analysis based on a few parsimony informative sites yielded low bootstrap values for the majority of clades [12]. Thus, it is necessary to develop more highly variable genetic markers for determining the phylogenetic relationships and divergence times for Croomia. Nowadays, many phylogenetic relationships that remained unresolved with few loci have been clarified by using whole cp genome sequences [19][20][21]. Thus, whole cp genome sequences are increasingly being used in phylogeny reconstruction and providing hypervariable genetic markers for population genetic studies, especially in a group of recently-diverged species [22,23].
Here, we sequenced three Croomia and two Stemona cp genomes using the next-generation Illumina genome analyzer platform. We compared the cp genomes of two Stemonaceae genera to characterize their structural organization and variations and identify the most variable regions. This information on interspecific variability of each region will help guide further systematic and evolutionary studies of Stemonaceae. In addition, we used the whole cp genomes to resolve the phylogenetic relationships of Croomia and infer the historical biogeography of the genus.

Genome Assembly and Features
Illumina paired-end sequencing yielded 14,163,520-31,094,272-bp clean reads after trimming, and the de novo assembly generated 50,369-123,479 contigs for five Stemonaceae species. With the cp genome of C. palmata as a reference, contigs were combined to generate the draft cp genome for each species. The lengths of determined nucleotide sequences were 154,672, 154,407, 155,261, 154,224 and 154,307 bp for C. japonica, C. heterosepala, C. pauciflora, S. japonica and S. mairei, respectively. (Figure 1, Table S1). All five cp genomes exhibited the typical quadripartite structure of angiosperms, consisting of a pair of IR regions (27,243 bp) separated by an LSC region (81,844-82,429 bp) and an SSC region (17,889-18,346 bp). The cp genomes of three Croomia species and two Stemona species were deposited in GenBank (MH177871, MH191379-MH191382). These five cp genomes contained 134 genes identically, of which 114 were unique and 20 were duplicated in IR regions (Table S1). Those 134 genes were arranged almost in the same order except cemA and petA, which were inverted at the LSC region of two Stemona species. Gene inversions at LSC were also reported in other angiosperm, such as Silene [24], Cymbidium [19] and Acacia dealbata [25]. The 114 unique genes included 80 protein-coding genes, 30 tRNA genes and 4 rRNA genes. In Croomia species, the overall GC content was 38.3%, and the GC contents of the LSC, SSC and IR regions were 36.6%, 32.3-32.5% and 42.8-42.9%, respectively, while those of Stemona were 38.0%, 36.2%, 32.1% and 42.7% (Table S1). In all five genomes, nine of the protein-coding genes and six of the tRNA genes possessed a single intron, while three genes (rps12, clpP and ycf3) contained two introns ( Table 1). The rps12 gene was trans-spliced; the 5′ end exon was located in the LSC region, and the 3′ end exon and intron were located in the IR regions. Compared to many other species, such as Salvia miltiorrhiza [26] and Cornales [27], the SSC region of the five studied species was found to have a different (reverse) orientation. The reverse orientation of the SSC region has also been reported in a wide variety of plant species [28][29][30]. This phenomenon is sometimes interpreted as a These five cp genomes contained 134 genes identically, of which 114 were unique and 20 were duplicated in IR regions (Table S1). Those 134 genes were arranged almost in the same order except cemA and petA, which were inverted at the LSC region of two Stemona species. Gene inversions at LSC were also reported in other angiosperm, such as Silene [24], Cymbidium [19] and Acacia dealbata [25]. The 114 unique genes included 80 protein-coding genes, 30 tRNA genes and 4 rRNA genes. In Croomia species, the overall GC content was 38.3%, and the GC contents of the LSC, SSC and IR regions were 36.6%, 32.3-32.5% and 42.8-42.9%, respectively, while those of Stemona were 38.0%, 36.2%, 32.1% and 42.7% (Table S1). In all five genomes, nine of the protein-coding genes and six of the tRNA genes possessed a single intron, while three genes (rps12, clpP and ycf 3) contained two introns ( Table 1). The rps12 gene was trans-spliced; the 5 end exon was located in the LSC region, and the 3 end exon and intron were located in the IR regions. Compared to many other species, such as Salvia miltiorrhiza [26] and Cornales [27], the SSC region of the five studied species was found to have a different (reverse) orientation. The reverse orientation of the SSC region has also been reported in a wide variety of plant species [28][29][30]. This phenomenon is sometimes interpreted as a major inversion existing within the species [29,31,32]. In fact, the two orientations of the SSC region have been found to occur regularly during the course of chloroplast DNA replication within individual plant cells [33,34]. Thus, the reverse orientation of the SSC region found in the five Stemonaceae cp genomes may represent a form of plastid heteroplasmy [30,35]. Table 1. List of genes in Stemonaceae chloroplast genomes.

Category of Genes
Groups of Genes Names of Genes

Contraction and Expansion of Inverted Repeats
Length variation in angiosperm cp genomes is due most typically to the expansion or contraction of the IR into or out of adjacent single-copy regions and/or changes in sequence complexity due to insertions or deletions of novel sequences [36,37]. Compared to reference cp genome C. palmata, all five species exhibited IR expansion at the IRb/LSC border, leading to entire rpl22 duplication. In a previous study, a partial duplication of the rpl22 gene was reported in some monocot species of Asparagales and Commelinales [38]. Although the gene number and gene order were conserved across these five Stemonaceae species, minor differences were still observed at the boundaries (Figure 2). At the IRa/LSC border, the spacer from rpl22 to this border of Stemona (65 bp) was longer than that of Croomia

Phylogenetic Analysis, Divergence Time and Ancestral Area Reconstruction
CP genome sequences have been successfully used in angiosperm phylogenetic studies [22,53]. The Maximum Likelihood (ML) and Bayesian Inference (BI) analyses of both whole sequences and protein-coding region of three Croomia and two Stemona cp genomes yielded nearly identical tree topologies, with 100% bootstrap and 1.0 Bayesian posterior probabilities at each node (Figure 7). This phylogenetic tree supports the monophyly of Croomia. Two Asian species C. japonica and C. heterosepala formed a clade, being strongly recovered as sisters of the North American species C. pauciflora. This tree topology is largely congruent with that inferred from trnL-F [12], but obtained much higher bootstrap support values. Using average substitution rates of whole cp genomes, the divergence time between the two Asian species, C. japonica and C. heterosepala, was estimated as ca.

Phylogenetic Analysis, Divergence Time and Ancestral Area Reconstruction
CP genome sequences have been successfully used in angiosperm phylogenetic studies [22,53]. The Maximum Likelihood (ML) and Bayesian Inference (BI) analyses of both whole sequences and protein-coding region of three Croomia and two Stemona cp genomes yielded nearly identical tree topologies, with 100% bootstrap and 1.0 Bayesian posterior probabilities at each node (Figure 7). This phylogenetic tree supports the monophyly of Croomia. Two Asian species C. japonica and C. heterosepala formed a clade, being strongly recovered as sisters of the North American species C. pauciflora. This tree topology is largely congruent with that inferred from trnL-F [12], but obtained much higher bootstrap support values. Using average substitution rates of whole cp genomes, the divergence time between the two Asian species, C. japonica and C. heterosepala, was estimated as ca.  The divergence between C. pauciflora and two Asian species coincides with the first sundering of the Bering Land Bridge (BLB) between the late Miocene and early Pliocene, most approximately at 5.4-5.5 Mya (Milne and Abbott, 2002) [54]. The Bayesian Binary MCMC (BBM) analysis of ancestral area reconstruction identified Asia as the most likely ancestral range (Node III, marginal probability: 0.93; Figure A2), indicating a possible intercontinental plant migration from Asia to North America. Indeed, the BLB served as an important route for temperate floristic exchanges between Asia and North America from the Eocene to the early Pliocene [55,56]. Subsequently, as a member of the Tertiary relict flora [15], Croomia species on the two continents experienced disjunct distribution and evolved separately after the Late Miocene. Thus, we conclude that the current distribution and differentiation of Croomia species in eastern Asia and eastern North America likely resulted from a combination of ancient migration and vicariant events. The divergence time between C. japonica and C. heterosepala fell into the Early Pleistocene. Habitat fragmentation resulting from the climatic vicissitudes of the (Late) Quaternary likely led to the speciation of C. japonica and C. heterosepala [12]. The above inferences seem to be consistent with the palaeovegetational and climatic history of eastern Asia and eastern America. However, considering that the cp genome is a haploid, uniparentaly-inherited and single locus [57], a nuclear (biparental) marker is also needed to elucidate the diversification process and demography history of Croomia species.

Sample Preparation, Sequencing, Assembly and Validation
Fresh leaves of C. japonica from China, C. heterosepala from Japan, C. pauciflora from North America and two outgroup species Stemona japonica (Bl.) Miq. and S. mairei (Levl.) Krause from China were sampled and dried with silica gel. The voucher specimens were deposited in the Herbarium of Zhejiang University (HZU). Total genomic DNA was extracted from ~3 mg materials using DNA Plantzol Reagent (Invitrogen, Carlsbad, CA, USA) following the manufacturer's protocol. The quality and concentration of the DNA were detected using agarose gel electrophoresis. Purified DNA was sheared into ~500-bp fragments, and the fragmentation quality was checked on a Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA). Paired-end sequencing libraries The divergence between C. pauciflora and two Asian species coincides with the first sundering of the Bering Land Bridge (BLB) between the late Miocene and early Pliocene, most approximately at 5.4-5.5 Mya (Milne and Abbott, 2002) [54]. The Bayesian Binary MCMC (BBM) analysis of ancestral area reconstruction identified Asia as the most likely ancestral range (Node III, marginal probability: 0.93; Figure A2), indicating a possible intercontinental plant migration from Asia to North America. Indeed, the BLB served as an important route for temperate floristic exchanges between Asia and North America from the Eocene to the early Pliocene [55,56]. Subsequently, as a member of the Tertiary relict flora [15], Croomia species on the two continents experienced disjunct distribution and evolved separately after the Late Miocene. Thus, we conclude that the current distribution and differentiation of Croomia species in eastern Asia and eastern North America likely resulted from a combination of ancient migration and vicariant events. The divergence time between C. japonica and C. heterosepala fell into the Early Pleistocene. Habitat fragmentation resulting from the climatic vicissitudes of the (Late) Quaternary likely led to the speciation of C. japonica and C. heterosepala [12]. The above inferences seem to be consistent with the palaeovegetational and climatic history of eastern Asia and eastern America. However, considering that the cp genome is a haploid, uniparentaly-inherited and single locus [57], a nuclear (biparental) marker is also needed to elucidate the diversification process and demography history of Croomia species.

Sample Preparation, Sequencing, Assembly and Validation
Fresh leaves of C. japonica from China, C. heterosepala from Japan, C. pauciflora from North America and two outgroup species Stemona japonica (Bl.) Miq. and S. mairei (Levl.) Krause from China were sampled and dried with silica gel. The voucher specimens were deposited in the Herbarium of Zhejiang University (HZU). Total genomic DNA was extracted from~3 mg materials using DNA Plantzol Reagent (Invitrogen, Carlsbad, CA, USA) following the manufacturer's protocol. The quality and concentration of the DNA were detected using agarose gel electrophoresis. Purified DNA was sheared into~500-bp fragments, and the fragmentation quality was checked on a Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA). Paired-end sequencing libraries were constructed according to the Illumina standard protocol (Illumina, San Diego, CA, USA). Genomic DNAs of five species were sequenced using an Illumina HiSeq TM 2000 (Illumina, San Diego, CA, USA) at Beijing Genomics Institute (BGI; Shenzhen, China). Plastome sequences were assembled using a combination of de novo and reference-guided assembly [58]. Firstly, to obtain clean reads, the CLC-quality trim tool was used to remove low-quality bases (Q < 20, 0.01 probability error). Secondly, we assembled the clean reads into contigs on the CLC de novo assembler. Thirdly, all the contigs were aligned with the reference cp genome of Carludovica palmate Ruiz. & Pav. (NC_026786.1) using local BLAST (http://blast.ncbi.nlm.nih.gov/) (27 December 2016), and aligned contigs were ordered according to the reference cp genome with ≥90% similarity and query coverage. Then, to construct the draft cp genome of each species, the ordered contigs usually representing the whole reconstructed genome were imported into GENEIOUS v9.0.5 software (http://www.geneious.com) (18 March 2017), where the clean reads were remapped onto the contigs.

Genome Annotation and Whole Genome Comparison
The annotation of five species was performed using the Dual Organellar GenoMe Annotator (DOGMA) [59]. The start and stop codons and intron/exon boundaries were manually corrected by comparison to homologous genes from the reference genome of C. palmate. We also verified the transfer RNAs (tRNAs) using tRNAscan-SE v1.21 with default settings [60]. The circular genome maps were drawn using the OrganellarGenome DRAW tool (OGDRAW) [61], followed by manual modification.
Genome comparison among the five Stemonaceae cp genomes was analyzed using mVISTA [62] with C. palmate as a reference. Six genome sequences were aligned in Shuffle-LAGAN mode with default parameters, and the conservation region was visualized in an mVISTA plot. To identify the divergence hotspot regions in the five Stemonaceae cp genomes, the nucleotide variability of protein coding genes, introns and intergenic spacer sequences of five species were evaluated using DNASP v5.10 [63]. The above regions were extracted following two criteria: (a) total number of mutation (Eta) >0; and (b) the aligned length >200 bp. The inverted regions in cemA, cemA-petA and petA were excluded. The top ten most variable noncoding regions with a high Pi value were counted by Potentially Informative Characters (PICs) across species pair of C. japonica and S. japonica following Shaw et al. [64]. Any large structural event of the cp genome, such as gene order rearrangements or IR expansion/contractions, were recorded.

Characterization of Repeat Sequence and SSRs
REPUTER [49] was used to find the location and length of repeat sequences, including forward, palindrome, complement and reverse repeats in the five cp genomes. The minimum repeat size was set to 30 bp, and the sequence identity of repeats was no less than 90% or greater sequence identity with the Hamming distance equal to 3. The MISA perl script was used to detect simple sequence repeats (SSRs) [52] with thresholds of 10 bp in length for mono-, di-, tri, tetra-, penta-and hexa-nucleotide SSRs.

Phylogenetic Analysis, Divergence Time and Ancestral Area Reconstruction
The five cp genome sequences were aligned using MAFFT v7 [65]. Two Stemona species were used as outgroups. ML and BI analysis were used to reconstruct the phylogenetic trees. In order to examine the phylogenetic utility of different regions, two datasets were used: (1) the complete cp genome sequences; (2) 78 protein-coding genes shared by the five cp genomes (two inverted genes of cemA and petA in Stemona species were excluded). Gaps (indels) were treated as missing data. The Akaike Information Criterion (AIC) in JMODELTEST v2.1.4 [66] was used to determine the best-fitting model of nucleotide substitutions. The GTR + I + G model was used for two datasets. The ML tree was constructed using RAXML-HPC v8.2.10 with 1000 replicates on the Cyberinfrastructure for Phylogenetic Research (CIPRES) Science Gateway website (http://www.phylo.org/) (10 May 2017) [67]. BI analysis was conducted in MRBAYES v3.2 [68]. The Markov chain Monte Carlo (MCMC) was set to run 1,000,000 generations and sampled every 1000 generations. The first 25% of generations was discarded as burn-in.
Due to the lack of fossil records, we used the average substitution rate 0.51952 × 10 −9 per site per year (s/s/y) of the whole cp genome in Brassicaceae [69,70] to estimate interspecific divergence time of Croomia. The Bayesian analysis was implemented in BEAST v1.8.4 [71] using the GTR + I + G substitution model. MCMC analysis of 20,000,000 generations was implemented, in which every 1000 generations were sampled, under an uncorrelated lognormal relaxed clock approach using the Yule speciation tree prior with the substitution rate. TRACER v1.6 [72] was used to check the effective population size (ESS) >200. TREEANNOTATOR v.1.8.4 [73] was used to produce maximum clade credibility trees from the trees after burning-in of 25%. The final tree was visualized in FIGTREE v1.4.3 (http://tree.bio.ed.ac.uk/software/figtree/) (13 May 2017).
To reconstruct the historical biogeography of Croomia, we performed Bayesian Binary MCMC (BBM) analysis as implemented in RASP v3.1 [74] using trees retained from the BI analysis (see above). According to the distribution of Croomia, we defined the following two areas: A, Asia (East Asia/South Asia); and B, North America. Accounting for phylogenetic uncertainty, we used 500 trees randomly chosen across all post-burn-in trees generated from BEAST analysis and ran the BBM analysis. A fixed JC + G (Jukes-Cantor + Gamma) model was chosen with a null root distribution. The MCMC chains were run for 500,000 generations, and every 100 generations were sampled. The ancestral ranges obtained were projected onto the MCC tree.

Conclusions
Here, we sequenced the first five complete cp genomes in Stemonaceae. Each genome possesses the typical structure shared with other angiosperm species. Several highly variable noncoding cpDNA regions were identified, which should be the best choices for future phylogenetic, phylogeographic and population-level genetic studies in Stemonaceae. The phylogenomic and biogeographic analyses of Croomia reveal that ancient migration and vicariance-driven allopatric speciation resulting from historical climate oscillations most likely played roles in the formation of the disjunct distributions and divergence of these three Croomia species.

Acknowledgments:
The authors thank Shota Sakaguchi from Kyoto University for collecting plant materials in Japan.

Conflicts of Interest:
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.