The giant diploid faba genome unlocks variation in a global protein crop

Increasing the proportion of locally produced plant protein in currently meat-rich diets could substantially reduce greenhouse gas emission and loss of biodiversity. However, plant protein production is hampered by the lack of a cool-season legume equivalent to soybean in agronomic value. Faba bean (Vicia faba L.) has a high yield potential and is well-suited for cultivation in temperate regions, but genomic resources are scarce. Here, we report a high-quality chromosome-scale assembly of the faba bean genome and show that it has grown to a massive 13 Gb in size through an imbalance between the rates of amplification and elimination of retrotransposons and satellite repeats. Genes and recombination events are evenly dispersed across chromosomes and the gene space is remarkably compact considering the genome size, though with significant copy number variation driven by tandem duplication. Demonstrating practical application of the genome sequence, we develop a targeted genotyping assay and use high-resolution genome-wide association (GWA) analysis to dissect the genetic basis of hilum colour. The resources presented constitute a genomics-based breeding platform for faba bean, enabling breeders and geneticists to accelerate improvement of sustainable protein production across Mediterranean, subtropical, and northern temperate agro-ecological zones.


Abstract
Increasing the proportion of locally produced plant protein in currently meat-rich diets could substantially reduce greenhouse gas emission and loss of biodiversity. However, plant protein production is hampered by the lack of a cool-season legume equivalent to soybean in agronomic value. Faba bean (Vicia faba L.) has a high yield potential and is well-suited for cultivation in temperate regions, but genomic resources are scarce. Here, we report a highquality chromosome-scale assembly of the faba bean genome and show that it has grown to a massive 13 Gb in size through an imbalance between the rates of amplification and elimination of retrotransposons and satellite repeats. Genes and recombination events are evenly dispersed across chromosomes and the gene space is remarkably compact considering the genome size, though with significant copy number variation driven by tandem duplication.
Demonstrating practical application of the genome sequence, we develop a targeted genotyping assay and use high-resolution genome-wide association (GWA) analysis to dissect the genetic basis of hilum colour. The resources presented constitute a genomics-based breeding platform for faba bean, enabling breeders and geneticists to accelerate improvement of sustainable protein production across Mediterranean, subtropical, and northern temperate agro-ecological zones.

Main
Replacing meat or milk protein with plant-based alternatives, even partially, would make a large contribution to reducing carbon emissions 1 and help to mitigate climate change. An obstacle to implementing this scheme is that meat-eating countries do not produce enough plant protein 2 . Common bean (Phaseolus vulgaris), chickpea (Cicer arietinum), and soybean (Glycine max), which supply dietary protein in tropical and subtropical countries, do not attain yield optima in temperate regions of Europe and America, where meat consumption is highest globally 3 . Faba bean (Vicia faba L., 2n=12) was domesticated in the Near East more than 10,000 BP 4,5 and its broad adaptability, value as a restorative crop in rotations and high nutritional density has propelled it to the status of a global crop grown on all continents except Antarctica 6 . It is the highest yielding of all grain legumes 7 and has a favourable protein content (c. 29%) compared to other cool season pulses such as pea, lentil and chickpea, making it a suitable candidate to meet challenging projected future protein demands. Furthermore, faba bean's high biological nitrogen fixation rates 8 and long duration of nectar-rich, pollinatorfriendly flowers 9 provide important ecosystem services which mean that faba bean cultivation is increasingly seen as a key crop in sustainable intensification strategies. On the other hand, its partially allogamous mating system and estimated 13 Gb genome size, which puts it amongst the largest diploid genomes of any major crop, coupled with a low seed multiplication rate, have made it a challenging target for breeders 10 . Significant progress has been made in faba bean genomics and pre-breeding research. The mining of the first faba bean transcriptomes and development of SNP-based genetic maps, which showed strong collinearity with model legumes, set the scene for the identification of the WD40 transcription factor underlying the Zero Tannin1 locus 11 while a combination of high resolution mapping, transcriptomic, and metabolomic approaches led to the cloning of the VC1 gene controlling seed content in vicine-convicine and paved the way for safer exploitation of the crop in the human food chain 12 . The lack of a reference genome sequence greatly complicated these studies, however, and improved faba bean genomic resources are urgently needed to accelerate crop improvement.

The sequence of the giant faba bean genome
The 13 Gb faba bean genome (2n=2x=12) is one of the largest among diploid field crops (Extended Data Figure 1a, 1b) and its dominant repeat family members are longer 13,14 (up to 25 kb) than those in similarly sized polyploid cereal genomes 15 . The biggest of its six chromosomes holds the equivalent of an entire human genome. Although aiding cytogenetics 16 , these properties made genome assembly very challenging before the emergence of long and accurate reads. We sequenced the genome of the inbred line 'Hedin/2' with PacBio HiFi long reads to 20-fold coverage and assembled 11.9 Gb of sequence, more than half of which was represented by contigs longer than 2.7 Mb (Extended Data Table 1). Table 1) and chromosome conformation capture sequencing (Hi-C) data placed 11.2 Gb (94 %) into chromosomal pseudomolecules (Fig. 1a, Extended Figure 2a). Chromatin immunoprecipitation sequencing for centromeric histone H3 pinpointed the locations of the centromeres in the 'Hedin/2' assembly and arm ratios were consistent with karyotypes (Extended Data Fig. 2b). The single metacentric chromosome 1 was the only one to adopt a Rabl configuration, evident from the presence of both a main and an anti-diagonal on that chromosome in Hi-C interaction plots ( Fig. 1a). This supports the notion that chromosome arms need to be of approximately equal size to spatially juxtapose in interphase. Some regions of the Hi-C contact matrices were empty for lack of mapped short reads (Fig. 1a). These white regions coincided with the location of enormous (up to 752 Mb) satellite arrays and aligned well with cytological maps of those repeats (Fig. 1 b-c). We also collected HiFi data (10-fold coverage) for the German variety 'Tiffany' and assembled these into a set of contigs with an N50 of 1.6 Mb and spanning 11.4 Gb (Extended Data Table 1). This level of completeness and contiguity was sufficient to arrange the contigs into pseudomolecules guided by the ´Hedin/2´ reference (Extended Data  Table 1). In the future, the 'Hedin/2' assembly is expected to become the nucleus of a faba pan-genome.

Transposable elements, not polyploidy, have enlarged the genome
The genome sequence of 'Hedin/2' was annotated with RNA sequencing data from eight diverse tissues (Supplementary Table 2), resulting in a total of 34,221 protein-coding genes (Extended Data Table 2). A similar number of gene models (34,043) was also predicted in the 'Tiffany' assembly. The predicted 'Hedin/2' gene models captured 96% of single-copy orthologues conserved in Embryophyta according to the BUSCO metric (Extended Data Table   3). Gene density was uniform along the chromosomes (except for the positions of satellite DNA arrays) without the proximal-distal gradient typically observed for grass chromosomes 17 .
Meiotic recombination displayed a similar distribution with an average of 27 genes per centimorgan (Fig. 1d, Extended Data Figure 4). Thus, despite its large genome, faba bean may be more amenable to genetic mapping than cereals, where up to a third of genes are locked in non-recombining pericentric regions 17 . Gene order was highly collinear and syntenic with other legumes (Fig. 2a). To further validate gene annotation, we aligned 262 Medicago truncatula genes related to symbiosis with rhizobia or arbuscular mycorrhizal fungi and found putative orthologues for them all. In addition, we verified using RNA-seq that a large subset of these genes were responsive to inoculation as expected 18,19,20 (Supplementary Table 3).
In contrast to gymnosperms with similarly gigantic genomes 21,22 introns in faba bean genes were not larger than in angiosperms with smaller genomes (Fig. 2f) but the intergenic space was much expanded (Fig. 2b). Moreover, the number of multi-copy gene families in faba bean was similar to related diploid species (Extended Data Figure 5), in contrast to soybean, which is considered a partially diploidized tetraploid 23 . Likewise, nucleotide substitution rates between paralogous and orthologous gene pairs place the last whole-genome duplication (WGD) event in the faba bean lineage at 55 million years ago (MYA) (Fig. 2c), well before the split from other Papilionoideae 24 (Fig. 2e, Extended Data Figure 6), a taxon that also includes pea and lentil (Lens culinaris), species from which faba bean diverged around 12.2 and 13.8 MYA, respectively. Although we did not find evidence for a recent WGD in faba bean, more genes were duplicated in tandem than in pea and lentil (Extended Data Figure 7a). These duplications post-date the last WGD and occurred later than tandem duplications in Arabidopsis thaliana and M. truncatula (Extended Data Figure 7b), two species whose genomes were also rich in such events, and coincided with recent TE expansion. Overall, there were 1,108 syntenic clusters of tandemly duplicated genes in 'Hedin/2' and 'Tiffany', some of which differed in copy number. Notably, the agronomically relevant family of leghemoglobins had expanded (Supplementary Table 5). Despite this, the absence of a lineage-specific WGD or widespread gene family expansion means that the proliferation of repeat elements largely explains why the faba genome is more than seven times larger than that of its close relative common vetch (V. sativa).
Approximately 79% of the 'Hedin/2' assembly was annotated as transposon-derived (Supplementary Table 6). By far the largest group is the long-terminal repeat (LTR) retrotransposons (RLX), accounting for 63.7 % of the genome sequence. Other groups of TEs represent only minor fractions of the genome (Supplementary Table 6). Among the RLX, those of the Gypsy (RLG) superfamily outnumber Copia (RLC) elements by more than two to one ( Fig. 1d, Extended Data Figure 4). The Ogre family of Gypsy elements alone make up almost half (44 %) of the genome, confirming its status as a major determinant of genome size in the Fabaceae 14 (Fig. 2g). The great length of individual elements (up to 35 kb for Ogre and 32 kb of SIRE, the longest and second-longest elements), together with their abundance, partially explains the large size of the faba bean genome (Extended Data Figure 8). In addition, a large and diverse set of satellite repeat families that differ in their monomer sequences and genome abundance 25 accounted for 9.4% of the total assembly length, with the most abundant satellite family FokI reaching 4% (0.475 Gb). FokI, together with several other highly amplified satellites, forms prominent heterochromatic bands on faba bean chromosomes (Fig. 1c). The TE density was remarkably invariable along all six chromosomes, mirroring gene density and recombination rate, and inverse to the density of satellite arrays (Fig. 1d, Extended Data

Efficient genome-wide methylation
In addition to the relatively slow RLX elimination rate, it is also possible that lower levels of methylation could have accelerated TE proliferation through less efficient silencing. We found that most cytosines in the faba bean genomes were methylated: 95.8 % in CG, 88.2 % in CHG and 14% in CHH contexts, respectively ( Copia, occupied 48% and 11% of the genome respectively, and were heavily methylated, more so in their bodies than their flanking regions (Fig. 3b). The most recent transposon burst occurred less than 1 MYA, but many structurally intact elements were between 3 and 5 million years old (Fig. 3c). Both young and old insertions were invariably methylated in all three sequence contexts (Fig. 3d). In contrast to other plant taxa 29 , RLX insertion times and methylation levels were uncoupled. Conspicuous islands of elevated CHH methylation also coincided with the abundant satellite repeat FabTR-83 (Fig. 3e), which accounts for 1.1% of the genome. Generally, the faba bean methylation machinery appeared fully functional, efficiently methylating all classes of repetitive elements, suggesting that methylation deficiency is unlikely to have played a role in genome expansion.

Integration of QTL and variation data
The faba bean genome sequence provides a unified frame of reference for genetic mapping, gene expression profiling and comparative genomics. To assist the adoption of the new infrastructure among faba bean breeders and geneticists, we mapped markers from two commonly used genotyping platforms, the Illumina Infinium 1,536 single-nucleotide polymorphism (SNP) and the Illumina Oligo Pool Array (OPA) assays. Moreover, we projected genetic maps of both different bi-parental crosses and derived consensus genetic maps onto the genome assembly. This provided physical coordinates to quantitative trait loci (QTL) for disease resistance and phenology. Marker maps and QTL intervals can be browsed interactively at https://pulses.plantinformatics.io (Extended Data Figure 10).
The genome sequence has also paved the way for sequence-based genotyping. We mined the 'Hedin/2' assembly for oligonucleotide probes for use in Single Primer Enrichment Technology (SPET) 30 , a reduced-representation genotyping method with high-throughput and low persample costs. A panel of 197 accessions from a diversity collection were profiled with a 90,000 probe SPET assay with at least one probe in each predicted gene (Supplementary Table 7).
Sequence reads were mapped to the 'Hedin/2' assembly and 1,081,031 segregating variants (SNPs) uniformly distributed along the genome were called, laying the foundations for highresolution GWAS analysis (Fig. 1d, Extended Data Figure 11). Population structure analysis by model-based ancestry estimation and principal component analysis (PCA) divided the diversity panel into four groups, corresponding to their geographic origin (Extended Data Figure 12).

Hilum colour mapping with candidate-gene resolution
To illustrate how a GWAS approach, in conjunction with multiple genome sequences, can help reveal the molecular basis for trait variation, we investigated the genetic control of seed hilum colour (Fig. 4a), which is an important quality trait 31 . We first carried out a GWAS for hilum colour using the diversity panel and identified a single prominent peak that was coincident both with the previously mapped Hilum Colour (HC) locus 31 and peak homozygosity in a recessive pale hilum bulk of segregants from a cross between pale and dark hilum faba bean varieties (Fig. 4b). We found the most highly associated GWA marker in a polyphenol oxidase (PPO) gene residing in a cluster of eight fully intact and highly conserved PPO genes in the 'Hedin/2' assembly. In pea, PPO variation controls hilum colour, and a frameshifted, nonfunctional form of the single PPO copy residing at the syntenic Pl locus conferring a pale hilum is fixed in all modern pea varieties 32 . The pattern of pigmentation (Fig. 4a) and content in oligomeric phenolic compounds at the hilum surface in faba bean ( Fig. 4c and Extended Data Figure 13) were very similar to those observed in pea 32 . Together with the genetic data, this indicates that differential PPO activity is responsible for hilum colour variation in both pea and faba bean, but it was unclear which faba bean PPO(s) may be causative.
To clarify, we compared the phylogeny and structure of the PPO clusters of the two fully sequenced genotypes 'Hedin/2' (dark) and 'Tiffany' (pale). VfPPO-2 shared the highest level of identity with the causal pea gene Psat1g2063360 (Fig. 4d), whereas the most strongly associated GWA marker was found in VfPPO-3 and the pale hilum bulk homozygosity peak sat between VfPPO-2 and VfPPO-3, suggesting that the causal polymorphism resided at the proximal end of the cluster (Fig. 4e). Structurally, apart from large differences in intergenic distances between syntenic PPOs caused mainly by Ogre insertions, the most striking features of the 'Hedin/2'-'Tiffany' comparison were the triplication of VfPPO-4 in 'Tiffany' and the absence of VfPPO-5 in 'Hedin/2' (Fig. 4e), prompting us to investigate whether these structural variations were associated with variation in PPO gene expression. We first established that transcription of the PPO gene cluster was almost exclusively confined to the maternal testa tissue (which encompasses the hilum), rather than the cotyledon in both genotypes (Fig. 4d, Extended Data Figure 14, Supplementary Table 8, Supplementary Table 9). In 'Hedin/2' testa, VfPPO-2, and to a lesser extent VfPPO-3, accounted for nearly all PPO expression. In contrast, 'Tiffany' testa PPO expression was dominated by VfPPO-6 and VfPPO-7 (Fig. 4d). A detailed annotation and comparative repeat analysis of the PPO cluster region (Extended Data Figure   15) highlighted a c. 2kb AT-rich MITE insertion in the TiPPO-2 promoter region (Figure 4f), which interrupts the sequence of predicted VfPPO-2 promoter and belongs to a class of MITE associated with high levels of methylation (Figure 4g). Taken together, our results suggest that regulation of expression of VfPPO-2 controls hilum colour variation in faba bean. Beyond suggesting a causative mechanism for pale hilum in faba bean, our analysis illustrates that increased copy number does not necessarily correlate with trait expression and emphasizes the utility of complete genome sequences from multiple genotypes.

Discussion
Faba bean is one of the earliest domesticated crops. It was part of the Neolithic package of crops that the early farmers took with them as they left the Fertile Crescent 33 and concern about the bean's toxicity was voiced already in classical antiquity 34 . In the 21 st century, nutritional quality remains a central breeding goal: new faba bean varieties should be low in the alkaloid glycosides vicine and convicine as well as in tannins; essential amino acids should be balanced better to accommodate human dietary needs; seed phytate and protease inhibitors should be reduced to improve nutrient bioavailability; all while taking care not to compromise pest resistance and improving yield stability. Faba bean breeders can now face these challenges enabled by genomic resources and insights. Ubiquitous and frequent recombination will allow rapid introgression of new traits into elite material and permits powerful and broadly applicable mapping approaches exploiting the high SNP densities provided by SPET genotyping. Pinpointing causative variants can still be difficult in genomic regions with tandemly duplicated genes, but our investigation of hilum colour demonstrates that these challenges can be overcome using high-quality long-read assemblies coupled with transcriptomics. Repeats and their methylation influence genome evolution, but can also coordinated the project.

Additional information
Supplementary information is available for this paper.

Genome size estimation and quality assessment
The distribution of the k-mer (K=101) frequency was estimated from PacBio HiFi reads using Jellyfish (v2.2.10) 11 . The output histograms were used to estimate the genome size and heterozygosity using findGSE 12 . The completeness of the assembly was assessed by two intendent approaches; i) self-alignment of HiFi reads to the assembly by minimap2 followed by SV calling using Sniffles 13 ; ii) BUSCO (v3.0.2b) 14 analysis with Embryophyta database 9.
The output histograms were used to estimate genome size and heterozygosity using findGSE.
Assembly was performed using hifiasm v0.15.5-r350 (-l 0). The completeness of the assembly was assessed by aligning HiFi reads back to contigs and calling structural variants using cuteSV v1.0.11 16 . Despite there being no obvious heterozygous peak on the k-mer plots ( Figure AS1), we observed a higher proportion of BUSCO duplicate genes in Tiffany compared to Hedin/2 and a slight over-estimation of genome size with findGSE. In addition, we also noted a number of short contigs with read coverage about half of the expected, suggesting the presence of regions of heterozygosity in the otherwise mostly homozygous genome. We therefore performed haplotig purging using purge_haplotigs v1.1.2 17 (purge_haplotigs cov -l 3 -m 7 -h 25). Chromosome-level scaffolds were constructed with RagTag v2.0.1 18 using the haplotigpurged assembly. To confirm the success of scaffolding, Hedin/2 and Tiffany chromosomes were aligned using GSAlign v1.0.22 19 . We compared two approaches for Tiffany annotation in order to choose one most suitable for comparative analyses: (i) individual annotation of genomes; (ii) a "transfer and gap fill" approach (Extended Data Figure 16), as described below.
Overall, we observed that the transfer and gap fill approach resulted in more syntenic genes and more genes with the same CDS length. Visual inspection of annotation suggests that with the individual annotation approach, alternative transcripts of genes are annotated in some cases, which results in different CDS lengths for syntenic genes. The issue was largely resolved by using the transfer and gap fill approach. Completeness of the annotation was assessed for Hedin/2 and Tiffany by aligning one Iso-Seq dataset 28 and assembled transcriptomes produced for cvs. Hiverna, Dozah and Farah.

Gene model annotation
Transcriptomes were mapped using GMAP v2020-10-14 and comparisons between those mappings and the annotations were made using bedtools 29 . Gene models that had been removed by polishing, but which intersected mapped transcripts, were rescued if the transcript wasn't a putative transposable element. R genes were detected on the unpolished and polished annotations using RGAugury 30 . R genes present in the unpolished annotation but not in the polished one were also rescued. The coding potential for each transcript was computed with CPC2 31 . The mRNAs with low coding potential were reclassified as lncRNAs.
Genes of which at least 50% overlapped a transposable element domain were removed.
Finally, any proteins that contained in-frame stop codons after phase correction were also removed. The completeness of the final gene set was assessed using BUSCOv5.2.2 with the embryophyta_odb10 and fabales_odb10 databases.
To ensure maximum comparability between annotations for Tiffany and Hedin/2 we employed the transfer and gap-fill strategy. In this approach, Hedin/2 complete protein coding gene annotations were transferred onto Tiffany assembly using Liftoff v1.6.1 32 . Transcripts with in-frame stop codons were removed and gaps between transferred genes were filled with Tiffany genes from the annotation obtained using the method described above, corresponding to genes removed due to in-frame stop codons and those which were Tiffanyspecific.

Symbiotic gene discovery
Total RNA sequencing was carried out for three biological replicates per condition. Eighteen weighted t-type test statistic. We then calculated a false discovery rate correction for multiplehypothesis tests 34 . Only genes showing a difference of 10 reads between compared conditions were considered as significantly expressed.

Orthologous gene family identification
Genes from 19 legume species (Extended data Table 4) were clustered to determine the orthologues relationship. The protein sequences from these species were aligned to each other using BLASTP v2.2.26 35 with the parameter "evalue 1e-5". The results were then used to cluster the gene families by OrthoMCL v2.0.9 36 .

Phylogenetic analysis and divergence time estimation
The single-copy genes identified from 19 legume species (Extended data

Whole genome duplication
The whole-genome duplication of V. faba, M. truncatula and P. sativum were estimated using the collinearity within each genome. First, synteny regions were identified using MCScanX 39 .
Then, the gene pairs in the synteny regions were used for 4dtv (four-fold degenerate transversions) calculation. The transversion rate was corrected by the HKY 40 model. The synonymous (Ks) and non-synonymous (Ka) substitution was estimated by KaKs_Calculator

Tandem duplicate gene discovery
Tandemly duplicated genes were also discovered using the CRBHits v0.0.4 package 42 function tandemdups. To confirm the results, genes were also classified using DupGen_finder 43 , with Arabidopsis thaliana serving as the outgroup. Vicia sativa was excluded from TD analysis due to suspected fragmentation of its structural annotation, which could result in inflation in the number of genes annotated as TDs (Supplementary Table 4). The age of duplications was estimated using T = K s /2 r, r = 1.5 × 10 −8 . Ks was calculated using CRBHits using method 'Li'.
Synteny between Hedin/2 and Tiffany genes was analysed using CRBHits v0.0.4 package function rbh2dagchainer (type = "idx", gap_length = 1, max_dist_allowed = 20), which internally uses the DAGchainer algorithm 44 . Syntenic TDG clusters were discovered by connecting TDG clusters in individual genomes using the syntenic gene pairs found between Hedin/2 and Tiffany. To minimize the effect of unplaced genes on copy number variation (CNV) analysis, as unplaced genes can result in spurious CNV calls, we corroborated the synteny-based results with Orthofinder 45 analysis. Only clusters that had the same or higher copy number in the same genotype, based on both synteny and Orthofinder results (for Orthofinder only genes on the matching chromosome and unplaced genes were considered), were retained for further analysis. Syntenic clusters were functionally annotated with Human Readable Descriptions (HRDs) using prot-scriber v0.1.0 (https://github.com/usadellab/protscriber).

Hilum colour and Histology
To examine hilum morphology, seed coat containing hilum from inbred lines Hedin/2 (dark hilum) and Tiffany (pale hilum) were dissected from mature dry seed, saturated with 2% sucrose solution under vacuum for 1 hour and embedded in cryo-gel media (Cryo-gel Leica).
Observation and photography was done on an Olympus BX 51 microscope (Olympus Corp., and similarity between sequences was visualised using FlexiDot.

Gene Expression Analysis
RNA was extracted from 100mg of flash-frozen dissected tissue (testa and cotyledon) using a Transcript level abundance was converted to gene level abundance using tximport.

Extended Data Tables
Extended data Table 1 Extended Data Figure 5. A plot of homologous gene number in each of 19 species. Single-copy orthologues, single-copy homologous genes in the gene families shared among species; multiplecopy orthologues, multicopy homologous genes in gene families shared among species; unique paralogues, genes of the strain unique to the family; other orthologs, all other genes; unclustered genes, genes not clustered into any family The horizontal bars represent the number of proteincoding genes.   Number of ancestral populations py Extended Data Figure 11. a, Summary of high-quality SNPs identified in each of the six chromosomes. b, Distribution of SNPs along each chromosome. c, The optimal K value estimated by Admixture analysis.