Intra-bin structural variant segmentation for whole-genome sequencing data using U-net

Motivation: Structure variants (SVs) are complex genomic alterations with affected genome sizes larger than 50 nucleotides. They have been found in both healthy individuals and patients, which contribute to phenotypic traits and are highly correlated with diseases, such as cancer. With the advance of sequencing technologies, more precise proﬁling of SVs becomes available. It is still a challenging task to determine SVs in nucleotide-resolution with satisfying both high accuracy and coverage for a single sample. Result: In this paper, we proposed a novel nucleotide-resolution SV detection method for determining SV fragment inside of a bin with base-pair read-depth (RD) information. Distinguished from traditional RD based SV detectors, which usually require grouping and smoothing RD signals at the scale of a bin, we use a U-net model to directly process base-pair RD signals inside of a bin and learn feature representations for intra-bin SV segmentation. The proposed method can be used solely for searching target SVs with RD signature or integrated with previous bin-based SV detection methods to exceed the bin-size limitation. We did a systematic evaluation of the method using WGS data from 1000 Genomes Project. In both single-sample and cross-sample experiments, the U-net model achieves signiﬁcantly better performances on detecting intra-bin SV fragments when compared with convolutional neural networks (CNN). Furthermore, the model shows a good ability to detect speciﬁc SV subtypes, such as ALU deletion.


Introduction
Structural variants (SVs), including copy number variations (CNVs), are genomic alterations larger than 50 base pairs (bp). When compared with single nucleotide polymorphisms (SNPs), the larger size of SVs makes them more likely to alter genomic structures and have functional consequences. In many diseases, such as Alzheimer's disease and many types of cancer, SVs have been found as important contributors [1,2,3,4]. To identify SV patterns associated with diseases, complete discovery and accurate genotyping of SVs is the first and foremost step.
Currently, there are mainly two types of platforms used for SV detection and genotyping. One is array based, such as array comparative genomic hybridization (aCGH) and SNP array, which is relatively cheap. The other one is next-generation sequencing (NGS) based, which provides higher resolution of SV profiles. For NGS data, many bioinformatics methods have been developed, which utilize different characters of reads related to SVs. Those methods can be categorized into four types according to used read information: read depth (RD)(or read count) [5], paired-end mapping (PEM) [6], split-read (SR) [7] and assembly-based (AS) [8]. An RD based approach assumes the number of reads in a genomic region is proportional to the number of times the region appears in the sample. In general, it fits a random distribution (e.g., negative binomial) to read counts and finds regions with abnormal read counts according to the fitted model. It detects deletion and duplication and is commonly used to detect CNV. PEM uses span and orientation of paired-end reads, which can detect several forms of SVs, such as inversion and tandem duplication, beside deletion and duplication. SR can find nucleotide-resolution breakpoints of SVs based on the observation that an adjoining short-read sequence is separately aligned in different coordinates of a reference genome. AS is a general approach that first does de novo assembly of sample genome based on sequenced reads and then compares the assembled genome with reference genome for structural differences. Theoretically, all types of SVs can be detected by an AS method. In practice, it usually requires high-depth of coverage for genome assembly and has more computational cost. Those SV detection methods vary in both sensitivity and specificity. Methods that integrate more than one type of read signals are also developed. For example, DELLY [6] integrated both PE and SR for SV detection. In this work, we focus on the RD based approach and aim to exceed the resolution of bin signals.
On the other hand, in recent years, deep learning has achieved impressive results for several tasks in computational biology, such as predicting sequence specificity of binding [9,10]. For deep learning applications related to SV detection, Poplin et al. proposed DeepVariant [11], which uses a well-calibrated convolutional neural network (CNN) to call SNP and small-indel variants based on image pileups around putative variant sites. Eghbal-zadeh et al. [12] used SNP-array genomic data and applied deep neural network with attention mechanism for breakpoint detection. Those works formulate the detection task as a classification problem and trained deep networks to predict SV labels.
In this paper, we focus a task of finding somatic SV fragments inside of a bin using base-pair RD signals, which refers to as "intra-bin SV segmentation". Usually, base-pair RD signals are too noisy to be directly used. In previous RD-based approaches, non-overlapping bins with fixedlength are applied to group base-pair RD signals and calculate bin-based RD signals for SV detection. Therefore, the resolution of breakpoints detected by those approaches is naturally limited by the bin-size. For an SV inside of a bin, it is easy to be overwhelmed. To conquer this limitation, we propose to use a U-net model for the task. The U-net [13] is the state-of-the-art deep learning model widely used in many image semantic segmentation tasks. The encode-and-decode architecture makes it possible to directly process noisy base-pair RD signals of wholegenome sequencing alignment data. Meanwhile, instead of formulating as a classification task, we treat the problem as a segmentation task, which can provide higher-resolution of breakpoints inside of a bin. The proposed method can be used solely for searching target SVs inside of a bin or integrated with previous bin-based SV detection methods to exceed the bin-size limitation.
We evaluated model performance on real WGS data and performed systematic experiments for intra-sample and cross-sample applications. The experiment results demonstrate that a U-net model can be trained to learn base-pair RD signatures of annotated SV segments inside of a bin. When compared with a CNN model, the U-net achieves significantly better performance and produces more integrated segments. In addition, the Unet model shows a good ability to detect SV subtypes related to mobile elements, such as ALU deletion. We also evaluate the segmentation results in a simplified SV classification task. The U-net model also achieves better AUC score, when compared with SVM.

Methods
In this section, we formulate the intra-bin SV segmentation task and describe it in an integrated pipeline. Then, we describe the U-net model in detail for the segmentation task.

Overall pipeline
We first describe a pipeline, in which the proposed model is supposed to be used. The overall pipeline is described in Figure 1. It is a two-stage pipeline, which consists of coarse-and-global search part and fine-andlocal segmentation part. For a sample, the first stage applies bin-based SV detection using previous RD-based method. Base-pair RD signals are grouped by a fixed-length bin (e.g., 1000 bp) as the basic unit for SV detection. In this stage, the whole-genome is globally screened and a list of bin-resolution breakpoint candidates are generated. In the second stage, regions surrounding the breakpoint candidates are zoomed-in with nucleotide-resolution. An intra-bin segmentation model is applied to find SV fragments inside of the regions and acquires higher-resolution breakpoints. In the following part, we focus on the second stage of the intra-bin segmentation.

Problem definition
Formally, give read-depth vector X = {d 1 , d 2 , ..., d l } for a target l-bp region, we aim to find its corresponding segmentation Y = {m 1 , m 2 , ..., m l }, where m i ∈ {0, 1} indicates whether local coordinate i is overlapped with any SV. It is a challenging task, due to the fact that boundary signal is noisy and ambiguous. [13] is a deep neural network featuring with a U-shape architecture. It has achieved state-of-the-art performance in many image semantic segmentation tasks. We propose to apply U-net model to genome field for SV detection. The structure of the U-net is described in Figure 2(c). It basically is an encode-and-decode architecture, in which left-U part encodes and right-U part decodes. The left-U contracts input signals through repeatedly applying convolutions, each followed by a rectified linear unit (ReLU), batch normalization (BN) and max-pooling operation. The right-U expands the down-sampled feature map by up-convolution and concatenate the correspondingly feature map from the left-U part. Each concatenated feature map is followed by two additional convolutions with ReLU and BN. At the output layer, 1x1 convolution is applied with sigmoid activation to predict mask vectors. Different from a classic CNN model, there is no fully connection layer in a U-net and tensors before down-sampling are concatenated with the up-sampling tensors to keep the position information. In applying U-net to the intra-bin SV segmentation task, we have to guarantee the output length exactly the same as the input length for locating breakpoints in the reference genome. In our U-net model, we use padded convolutions and select proper max-pooling pool size to avoid inconsistency of hidden numbers in the correspondence level of the network. There is no cropped feature map in the right-U part. The output layer is activated by a softmax function. We transform the value above the threshold of 0.5 to be 1, and others to be 0.

Training
To train a U-net model, we collect WGS data with SV annotation. For each SV, we generate regions (bins) with a fixed-length that contain breakpoint(s) of left-end or right-end. The breakpoint is not required in the centric position of a bin. We take a random shift to each breakpoint as the start position of the bin. For each breakpoint containing bin, SV overlapping coordinates are masked with 1, while the rest are masked with 0. We generate the same number of background regions without overlapping any SV as negative samples.
We use Dice similarity coefficient (DSC) as the loss function, and Adam [14] as the optimizer.
Where s gold and s pred are nucleotide label masks ∈ {0, 1}, and is a tolerance needed when there is no 1-masked label for both gold and predict vector. We apply early stopping with minimum 10 epochs and maximum 100 epochs for the training process.

Testing
The key assumption of applying a supervised model for intra-bin SV segmentation task is that read-depth signals surrounding a breakpoint of a specific SV type share a similar pattern for WGS samples with similar experiment settings (read-depth, especially). After training a U-net model, we apply the trained model to the same sample or a different sample with similar sequencing setting. For the different sample case, the read-depth signals are standardized separately. Putative regions are generated before

Data
In this paper, we evaluated the proposed model based on Illumina WGS data from the 1000 Genomes Project (1kGP). NA12878 was used as the benchmark data as previous works. We mainly used two typical sequencing depths, which refer to as 7x and 60x. In addition to the 7x and 60x data, we used 300x WGS data of NA12878 from NIST's Genome in a Bottle (GIAB) Project. For cross-sample evaluation, we used the following samples of both 7x and 60x data: NA19238 and NA19239 from YRI, NA18939 from JPT, HG00419 from CHS. WGS data were first aligned to a reference genome. BWA-MEM [15] is a choice to align reads to the reference genome of GRCh37 (primary reference including unplaced contigs) as in 1kGP. PCR duplicates were marked for downstream analysis. For those sample with alignment, we directly downloaded the released run-level bam files from 1kGP.
We generated data for the task based on autosomal SV annotation from 1kGP Phase 3 [16]. We parsed the VCF file and selected all SVs containing START and END position information for the above five samples. In total, four types of SV were included: DU P , DEL(_ * ), and CN V , which represent duplication, deletion, multi-allelic CNV, respectively. The insertions of mobile elements (SV A, LIN E1 and ALU ) were excluded for lacking END position information. SVs with length less than 50 bp were filtered out. For deletion, subtypes of DEL_SV A, DEL_ALU and DEL_LIN E1 were included. For each sample, the number of each SV type is summarized in Table 1. For test samples used in the crosssample evaluation, the number of the NA12878 non-overlapping SVs is also shown in the table.
We split SVs into training and test set for two experiment settings, separately. For the intra-sample setting, besides the stratified 5-fold crossvalidation, SVs were randomly shuffled and split into stratified training (80%) and test set (20%). For the cross-sample setting, SVs from one sample (NA12878) were used for the training set, while SVs from the other samples were used for test set. We then generated fixed-length bins used for model training and testing based on the SVs. The genome coordinates of each bin were determined following the method described in section 2.3.1. We extracted base-pair read-depth information for each bin X with pysam (Version 0.15.0). Low base quality (Phred < 30) ones were ignored in the depth counting. For those bins with all-zero values (no signal at all), we filtered them out. For each sample, bin vectors X were standardized using the mean and standard derivation of its read-depth. We split whole-genome into non-overlapping bins with the pre-fixed bin-size and filtered out low mappability (mappability score < 0.9) bins based on the mappability file from ENCODE project. We then randomly sampled 1% of the bins for calculating the mean and standard derivation.

Evaluation metric
For the segmentation task, we evaluated models using Dice similarity coefficient for all bins (DSC-ALL) and SV-containing bins (DSC-BK). For the breakpoints of SV in 1kGP, the precision of each one is different. Some of them do not contain confidence interval (CI) information (CIcontained:32182 SVs, no CI-contained:18828 SVs). Therefore, instead of evaluating the position of those breakpoints exactly, we used Dice similarity coefficient as an approximation. Up to the point of this work, there was few work focusing on predicting SV segments using base-pair RD information for a single sample. To compare model performances with an SVM model, we generated SV-containing and non-SV-containing labels based on the segmentation results for a simplified classification task. The classification performance was evaluated using AUC, sensitivity and false discovery rate (FDR).

Model parameters
Besides the U-net model (UNet), we applied a classic two-layer convolutional neural network (CNN) for the segmentation task. For training those deep learning models, we split 20% of the training data as the development set and used hyperOpt [17]   We first performed 5-fold cross-validation on the benchmark data of NA12878 for single sample evaluation with a bin-size of 1000 bp. The classification and segmentation results are shown in Table 2. For all the models, performance improves with the increase of read-depth. In the binary classification task, there are 9.1%, 11.5% and 14.3% relative increase (with statistical significance) on AUC from 7x to 60x for SVM, CNN and UNet, respectively. From 60x to 300x, there are 5.7%, 6.3% and 3.9% relative increase with statistical significance over AUC. For each model, the sensitivity increases and the FDR reduces with the increase of read-depth. In the segmentation task, the improvement is more significant from 7x to 60x, due to the lower performance on 7x data. There are 24.6% and 26.4% relative improvement over DSC-ALL score for CNN and UNet. From 60x to 300x, the gains reduce to 5.8% and 3.7% for CNN and UNet. On all the three depth data, UNet performs better the other two models. On 7x data, the classification performances are relatively close. UNet achieves a slightly higher AUC with lowest FDR. On 60x and 300x data, UNet performs significantly better with both higher sensitivity and lower FDR. When compared with CNN in segmentation task, DSC-ALL score of UNet is 6.7% and 4.6% higher on 60x and 300x data. We empirically evaluated how read-depth affects model performance in a more refinement step through down-sampling the NA12878 60x WGS data. We conducted 5-fold cross-validation for the WGS data of different down-sampling rate. In Figure 2, we can observe a general trend of performance improvement as the increase of read-depth. Curves in the region of down-sampling rate below 0.5 are relatively more steep when compared with curves in the rest region. It indicates the difficulty of doing intra-bin SV detection for a single sample in nucleotide-resolution using low-depth data. Based on performance gaps shown in the figure, 30x and 40x are empirically suggested minimal read-depth used for for intra-bin SV classification and segmentation for a single sample. Table 3. 5-fold cross-validation results of 100 bp, 500 bp and 1000 bp bin-size on 60x data. For UNet and CNN trained on GPU, we repeated fives times for the same setting and calculated the average results. The underline indicates the best score for each model on the three bin-size and bold font indicates the best performance among all models on all the three bin-sizes.

Bin size effect
We examined model performances with three different bin-sizes: 100 bp, 500 bp and 1000 bp for the WGS data of 60x read-depth. In the binary classification task, the AUCs do not show significant differences among the three bin-size settings for SVM and CNN. For SVM, the AUC slightly decreases as the increase of bin-size from 100 bp to 500 bp. CNN gets its best AUC of 0.8518 with 500 bp bin-size. For UNet, AUC is improved as the increase of bin-size. The best classification result achieved by UNet is 0.8846 at the bin size of 1000 bp. For all the three models, FDR reduces as the increase of bin-size. In the segmentation task, we can observe a significant performance gap between 100 bp and 500 bp (round 4.9% and 7.4% improvement in DSC-ALL from 100 bp to 500 bp for CNN and UNet, respectively), while the performances are close but different with statistical significance between 500 bp and 1000 bp. From a data perspective, as the increase of the bin-size, more context RD signals are considered. This is a benefit for reducing FDR of non-SV-containing bins, while it also has an effect of sensitivity reduction for SV-containing bins. From a model perspective, SVM is a shallow model with equivalent one-flatten-layer. CNN and U-net are deep neural networks consist of hierarchical multiple layers and shared local filters for the input layer. The local shared structure makes them less sensitive to the change of bin size in lower layers when compared with SVM. But the number of hidden units in a higher level reduces significantly as bin size decreases. This could be a reason that the performances of deep learning models decrease significantly when the bin-size is less than 500 bp. This effect is more obvious for UNet, which has more layers. In general, for each of the three bin-size, the deep neural network models (CNN and UNet) perform better than SVM, and UNet achieves the best performance in both classification and segmentation task.

Intra-bin segmentation result analysis
In this section, we investigate the intra-bin segmentation performance in more detail. We show the classification and segmentation result for each SV type in Figure 3 (results on 300x data is shown in the supplement document). As the test set is stratified, the distribution of SV types in the figure also reflects the overall SV distribution on the whole dataset. For SV-overlapping bins, the major SV type is the deletion, which holds around 92%. The rest SV types are CN V (7%) and DU P (0.2%) For deletion, the proportion of four subtypes in all SVs are: DEL(60%), DEL_ALU (31%), DEL_LIN E1(1%) and DEL_SV A(0.2%). Under such data configuration, the trained intra-bin segmentation models mainly learn features associated with the dominant-label of deletion. Consistent with the cross-validation results, the performance improvement brought by higher RD can be also observed for major labels (number >=50) of BG, DEL and DEL_ALU in both CNN and UNet. For non-sv-overlapping bins of BG, more correct predictions are made by UNet than CNN on 60x data. For major sv-overlapping bins of DEL and DEL_ALU , UNet not only classifies more bins correctly (first row of Figure 3), but also gives better segmentation results for those bins on both 7x and 60x (second row of Figure 3), when compared with CNN. For the minor sv-overlapping bins of CN V , DU P , DEL_SV A and DEL_LIN E1, the performances are more variant due to the small number of sample. From the figure, UNet performs slightly better than CNN on the segmentation of CN V bins on both 7x and 60x data. CNN performs slightly better than UNet for DEL_SV A and DEL_LIN E1 on 7x data, and the gap further reduces on 60x data. Both UNet and CNN achieve good performances on DEL_ALU segmentation, which is even better than their performance on DEL bins. On 60x data, UNet correctly detects 98.4% DEL_ALU bins in a total of 248 DEL_ALU bins and achieves 0.957 DSC-BK score. CNN detects 97.1% DEL_ALU bins, but has a lower segmentation performance of 0.893 DSC-BK score.
We demonstrate the examples of the intra-bin segmentation in Figure . We chose examples from the major bin labels of BG, DEL, DEL_ALU and CN V . CNV and UNet segmentation results are given for the selected bin regions. In general, based on the segmentation results on the test data, the UNet shows two advantages. First, UNet predicts more integrated segments. For example, in Figure (e,f,g,h), CNN makes more fragmentary segments, especially around breakpoint boundaries, while the segments given by UNet are integrated (Figure (b,c,d)). Second, UNet has lower false discovery fragment. As shown in Table 2, UNet achieves the lowest FDR. In the Figure (a), for the bin region of Chr18:(73901801, 73902801), the RD signal is even difficult for human to make a judgment. The UNet model successfully avoids any segmentation on the region, while generates closely correct segmentation for a "not so obvious" signals in the bin of Chr11:(134601496, 134602498) (Figure (d)). CNN makes a false positive segment in the region of Chr18:(73901801, 73902801).

data augmentation
We conducted data augmentation to increase the amount of the training data. For each SV breakpoint, we took a different random shift to get a new start position deviating from the SV breakpoints. To keep the data balance, we generated the same amount of bins without overlapping any SV breakpoints through random sampling. Each augmentation epoch generates the same amount of data as the original dataset. We did 10 epochs on the training data and evaluated on the non-augmented test set. The result is summarized in Table 4.
In general, for all the three models, FDR decreases after the data augmentation. But sensitivity does not necessarily increase. On 7x data, the performance of data-augmented UNet even decreases for the significant decrease of the sensitivity. Because the SV annotation used here is not perfect and 7x data is noisy for its low coverage. Although the data augmentation has a general positive effect on overall performance for SVM and CNN, UNet without data augmentation still performs better than SVM and UNet with the data augmentation.

Cross-sample evaluation
To investigate whether the segmentation model can be generalized across different samples, we performed a cross-sample evaluation. We trained on the whole dataset of a sample and tested on the non-overlapping regions on the other. In this experiment, the benchmark data NA12878 was used as the training set, while the other four samples were used for testing. The results on 60x depth WGS are shown in Table 5. Note that 60x depth is not the exact average depth for every sample, it is a rough indicator for samples with similar sequencing depth. For the three samples of NA19239, NA18939, and HG00419, UNet achieves the highest score on AUC and DSC(-ALL/-BK). For NA19238, CNN performed slightly better than UNet (around 0.009 and 0.003 absolute improvement on AUC and DSC-ALL, respectively), but it is not statistically significant (p-value AU C =0.13, pvalue DSC_ALL =0.77). In all the four test samples, UNet has the lowest FDR. The two deep models, CNN and UNet, perform better than SVM in the classification task on the dataset. From the results, we can observe that UNet keeps its low FDR character in the cross-sample evaluation, which is important for detecting recurrent SVs across different samples.

Discussion
In this paper, we focused on the challenging task of segmenting basepair RD signals inside of a bin. We proposed a novel approach of using U-net model to directly process noisy base-pair RD signals, which was impractical for directly applying previous statistical RD-based approaches. The U-net model uses an Encoder-and-decoder network structure and generates segmentation on original sequences. To the best of our knowledge, this is the first attempt to apply the U-net model for genome data. We conducted a set of experiments on real Illumina WGS data. In both segmentation and simplified classification task, the U-net model generally achieves better performances when compared with CNN and SVM. We also empirically evaluated the effect of read-depth and binsize for different model performances. In the cross-sample evaluation, the experiment shows the generalization of the U-net model across different samples with a similar read-depth. The proposed method can be used in the following two scenarios. First, searching for specific SVs inside of a bin for a single sample. Given a set of target SV annotations with base-pair RD signals, we can train a UNet model and search similar RD signatures associated with the target SVs on a whole-genome scale. Second, cross sample SV detection, which is used as an enhancement to traditional bin-based approaches. For the recent work of DeepVariant, it applied CNN on pileup images to detect SNP and small-indel variants in putative regions (the bin-size is usually set to 100 bp). This work is different from the deepVariant in the following two aspects. First, we focused on SVs that are larger than 50 bp and use one-dimensional base-pair RD signals as the input. Second, for SV detection, we formulated the problem as a segmentation task instead of a classification task. Therefore, the proposed U-net model can not only predict whether a bin contains any SV, but also can make SV segmentation for the bin in nucleotide-resolution.
The proposed U-net model can be also applied to the RD signals generated by long-read sequencing, such as the single molecule realtime (SMRT) sequencing technology from Pacific Biosciences (PacBio). When compared with the short-sequencing data, the depth-of-coverage signals in long-sequencing data is relatively more smooth than that of shortsequencing data. Therefore, we expect the long-read sequence would be easier for the U-net model to do the intra-bin segmentation. In this work, we only used RD signals to detect intra-bin SV segments. For the reason that the SV annotation is assembled from multiple sources of information, for some cases, using RD information only is not possible to make exact predictions. As an automatic feature learning framework, it is more flexible to incorporate other sources of biological signals in the U-net. We intend to explore other SV associated biological signatures besides the RD signals in the framework of U-net. Also, we can also apply the U-net framework to other similar tasks, such as transcript boundary detection [18].