Real-time resolution of short-read assembly graph using ONT long reads

A streaming assembly pipeline utilising real-time Oxford Nanopore Technology (ONT) sequencing data is important for saving sequencing resources and reducing time-to-result. A previous approach implemented in npScarf provided an efficient streaming algorithm for hybrid assembly but was relatively prone to mis-assemblies compared to other graph-based methods. Here we present npGraph, a streaming hybrid assembly tool using the assembly graph instead of the separated pre-assembly contigs. It is able to produce more complete genome assembly by resolving the path finding problem on the assembly graph using long reads as the traversing guide. Application to synthetic and real data from bacterial isolate genomes show improved accuracy while still maintaining a low computational cost. npGraph also provides a graphical user interface (GUI) which provides a real-time visualisation of the progress of assembly. The tool and source code is available at https://github.com/hsnguyen/assembly.


Introduction
Sequencing technology has reached a level of maturity which allows the decoding of virtually any piece of genetic material which can be obtained. However, the time from sample to result remains a barrier to adoption of sequencing technology into time critical applications such as infectious disease diagnostics or clinical decision making. While there exists real-time sequencing technology such as Oxford Nanopore Technologies (ONT); algorithms for streaming analyses 15 of such real-time data are still in their infancy. Effective streaming methodology will help bridge the gap between potential and practical use.
One particular strength of ONT technology is the production of ultra long reads. This is complementary to the dominant short read sequencing technology Illumina which is cheaper and has higher per base read quality but is unable to resolve the complex regions of the genome due to its read length limitation. A natural combination of the 20 two technologies is in hybrid assembly of genomes. Previously, we developed npScarf [1] an algorithm to scaffold a draft assembly from Illumina sequencing simultaneously with ONT sequencing. However, npScarf ignores the rich connectivity information in the short read assembly graph, and as a result is prone to mis-assembly.
Here, we present npGraph, a novel algorithm to resolve the assembly graph in real-time using long read sequencing. npGraph uses the stream of long reads to untangle knots in the assembly graph, which is maintained in memory. Because 25 of this, npGraph has better estimation of multiplicity of repeat contigs, resulting in fewer misassemblies. In addition, we develop a visualisation tool for practitioners to monitor the progress of the assembly process.

Results
Resolving assembly graph in real-time npGraph makes use of an assembly graph generated from assembling short reads using a de Bruijin graph method such 30 as SPAdes [2], Velvet [3] and AbySS [4]. The assembly graph consists of a list of contigs, and possible connections among these contigs. In building the assembly graph, the de Bruijin graph assembler attempts to extend each contig as far as possible, until there is more than one possible way of extending due to the repetitive sequences beyond the information contained in short reads. Hence each contig has multiple possible connections with others, creating knots in the assembly graph. npGraph uses the connectivity information from long reads to untangle these knots in real-time. 35 With sufficient data, when all the knots are removed, the assembly graph is simplified to a path which represents the complete assembly.
npGraph aligns long reads to the contigs in the assembly graph. When a long read is aligned to multiple contigs, npGraph constructs candidate paths that are supported by the read. This strategy allows npGraph to progressively update the likelihood of the paths going through a knot. When sufficient data are obtained, the best path is confidently 40 identified and hence the knot is untangled. In general, the bridging algorithm used to determine the best path is a combination of progressive merging, accumulated scoring and decision making modules. It operates on each pair of unique contigs, or anchors, by using a bridge data structure maintaining 2 anchors and a list of steps in-between as shown in Figure 1a. A set of candidate paths are listed and the best one can be selected amongst them given enough  Figure 1: Graph resolving algorithm. 1a the bridges suggested by long reads are merged progressively with dynamic programming to find the best path connecting 2 anchors. 1b A knot (repetitive contig) is unwound following the best path (highlighted in purple) leading to the graph simplification.
evidence. The de Bruijin graph is subsequently simplified when the bridge is replaced by a single edge representing the best path (1b). Figure 2: npGraph user interface including Console (0) and GUI components (1)(2)(3)(4)(5)(6). The GUI consists of the Dashboard (1-5) and the Graph View (6). From the Dashboard there are 5 components as follow: 1 the assembly graph input field; 2 the long reads input field; 3 the aligner settings field; 4 control buttons (start/stop) to monitor the real-time scaffolding process; 5 the statistics plots for the assembly result.
We also provide a Graphical User Interface (GUI) for npGraph. The GUI includes the dashboard for controlling the settings of the program and a window for visualization of the assembly graph in real-time ( Figure 2). In this interface, the assembly graph loading stage is separated from the actual assembly process so that users can check for the graph 5 quality first before carry out any further tasks. A proper combination of command line and GUI can provide an useful streaming pipeline that copes well with MinION output data. The practice is to support the real-time monitoring of the results from real-time sequencing [1,5,6] that allow the analysis to take place abreast to a nanopore sequencing run.
Evaluation using synthetic data To evaluate the performance of the method, npGraph 1.1 was tested along with SPAdes, SPAdes hybrid from version 3.13.1, [7], npScarf (japsa 1.7-02), and Unicycler version 0.4.6 using Unicycler's synthetic data set [8] . The data set is a simulation of Illumina and MinION raw data, generated in silico based on available microbial references. We ran hybrid assembly methods using the entire nanopore data and the reciprocal results were evaluated by QUAST 5.0.2 [9]. 5 To align the long reads to the assembly graph components, both BWA-MEM [10] or minimap2 [11] were used in conjunction with npGraph. These two methods were chosen due to their proven efficiency and compatibility with streaming data. 10 While BWA-MEM is a well-known classic aligner that can be adapted to work with third generation sequencing data, minimap2 has been specially designed for this data type. We observed a slightly higher error rate (comprising the sum of mismatches and indels per 100kb) using BWA-MEM in comparison to minimap2 for all simulations in Table 1. This is due to the fact that bridging paths induced using BWA-MEM were slightly less accurate due to more noise from the smaller steps in-between ( Figure 1a). However, under almost circumstances, using either aligner resulted in final assemblies with 15 comparable qualities. In terms of running time and resources required, minimap2 always proved to be a better option, requiring markedly less CPU time than BWA-MEM. Utilising minimap2, npGraph is now the fastest hybrid assembler available.
Amongst all assemblers, Unicycler applies an algorithm based on semi-global (or glocal) alignments [12] with the consensus long reads generated with the SeqAn library. With all of the data sets tested, Unicyclerrequired the most computational resources, but it also returned fewer mis-assemblies than the other approaches with a comparable rate of error (indels and mismatches) to npGraph. hybridSPAdes reported decent results with high fidelity at base level. As 5 the trade-off, there were fewer connections satisfying its quality threshold, resulting in the fragmented assemblies with lower N50 compared to the other hybrid assemblers. This behaviour was clearly reflected in the last, also the most challenging task of assembly S. dysenteriae.
Of the two streaming algorithms, npScarf utilizes a fast but greedy scaffolding approach that can lead to misassemblies and errors. For bacterial genomes with modest complexity these are minimal (e.g. K. pneumoniae), but for those with severe repetitive elements, extra calibrations are needed to prevent the mis-assembly due to ambiguous alignments. On the other hand, npGraph significantly reduced the errors compared to npScarf, sometimes even proved to be the best option e.g. for M. tuberculosis and K. pneumoniae. For the yeast S. cerevisiae data set, the npGraph assembly best covered the reference genome but the number of mis-assemblies was up to 5. The unfavourable figures, namely mis-assemblies and error, were still high in case of S. dysenteriae, due to the complicated and extremely 15 fragmented graph components containing a large number of small-scaled contigs that were difficult to map with nanopore data. The progressive path finding module tried to induce the most likely solution from a stream of coarse-grained alignments, without fully succeeding.
Hybrid assembly for real data sets A number of sequencing data sets from in vitro bacterial samples [13] were used to further explore differences in 20 performance between npGraph and Unicycler. The data included both Illumina paired-end reads and MinION sequencing based-call data for each sample. Due to the unavailability of reference genomes, there were fewer statistics reported by QUAST for the comparison of the results. Instead, we investigated the number of circular sequences and PlasmidFinder 1.3 [14] mappings to obtain an evaluation on the accuracy and completeness of the assemblies ( Table 2) on three data sets of bacterial species Citrobacter freundii, Enterobacter cloacae and Klebsiella oxytoca. There was high similarity between final contigs generated by two assemblers on all of these datasets. For the Citrobacter freundii dataset, they share the same number of circular sequences, including the chromosomal and other six replicons contigs in the , with only 48 nucleotides difference in the length of the main chromosome. Five out of six identical replicons could be confirmed as plasmids based on the occurence of origin of replication sequences from the PlasmidFinder database. In detail, two megaplasmids (longer than 100Kbp) were classified as IncFIB while the 30 other two mid-size replicons, 85.6Kbp and 43.6Kbp, were incL and repA respectively, leaving the shortest one with 2Kbp of length as ColRNAI plasmid. The remaining circular sequence without any hits to the database was 3.2Kbp long suggesting that it could be phage or a cryptic plasmid. Both assemblers had 14.5Kbp of unfinished sequences split amongst 3 linear contigs from Unicycler and 2 for npGraph.

25
The assembly task for Enterobacter cloacae was more challenging and the chromosomal DNA remained fragmented in two contigs for both methods ( of length 3.324Mbp and 1.534Mbp for npGraph compared to 2.829Mbp and 1.978Mbp for Unicycler). Both methods detected two plasmids ( IncR and repA), and Unicycler returned comlpete circular sequences for both plasmids, while npGraph returned circular sequence for one plasmid, while the other was fragmented into two contigs. Similar to the assembly of Citrobacter freundii , there was around 14Kbp of data which was unable to 5 be finished by the assemblers (split into 2 and 4 contigs for Unicycler and npGraph respectively).
Finally, the assembly for Klebsiella oxytoca saw fragmented chromosome using Unicycler (with 5 contigs) which was a fully complete single contig for npGraph with 6.156Mbp of size. The two assemblers shared 3 common circular sequences of which two were confirmed plasmids. The first identical sequence represented a megaplasmid ( 113Kbp) with two copies of IncFII origin of replication DNA being identified. The other 76Kbp plasmid circularised by both was 10 IncL/M with of length. The third circular contig of length 111Kbp returned no hits to the plasmid database, suggesting the importance of de novo replicon assembly in combination with further interrogation. Unicycler detected another megaplasmid of size 108.4Kbp which was fractured by npGraph. A fragmented contig was also observed in npGraph for the final contig of length 11.6Kbp where it failed to combine two smaller sequences into one.
In addition to what is presented in Table 2, dot plots for the pair-wise alignments between the assembly contigs were 15 generated and can be found in Supplementary Figure 1. This identified a structural difference between npGraph and Unicycler assembly for the E. cloacae CAV1411 genome assembly. This was caused by the inconsistency of a fragment's direction on the final output contigs. Comparison to a reference genome from the same bacteria strain (GenBank ID: CP011581.1 [15]), demonstrated that contigs generated by npGraph produced consistent alignment, but not those generated by Unicycler (Supplementary Figure 2). However, we cannot at this stage rule out genuine structural 20 variation between the two samples .

Assembly performance on streaming data
In order to investigate the rate at which the two streaming hybrid assembly algorithms completed bacterial assemblies, we plot the N50 as a function of long-read coverage on the 4 datasets described in the previous section ( Figure 3) .This revealed that npGraph and npScarf both converge to the same ultimate completeness but at different rates. npScarf 25 converged more quickly than npGraph, due to the fact that it is able to build bridges with only 1 spanning long-read, whereas npGraph requires 3 reads. Unlike npScarf where the connections could be undone and rectified later if needed, a bridge in npGraph will remain unchanged once created. The plot for E. coli data clarifies this behaviour when a fluctuation can be observed in npScarf assembly at 3-folds data coverage. On the other hand, the N50 length of npGraph is always a monotonic increasing function. The sharp jumping patterns suggested that the linking information 30 from long-read data had been stored and exploited at certain time point decided by the algorithm. In addition, at the end of the streaming when the sequencing is finished, npGraph will try for the last time to connect bridges with less than 3 supporting reads which are otherwise not part of conflicting bridges.

Discussion
Streaming assembly methods have been proven to be useful in saving time and resources compared to conventional 35 batch algorithms with examples including e.g. Faucet [16] and npScarf [1]. The first method allows the assembly graph to be constructed incrementally as long as reads are retrieved and processed. This practice is helpful dealing with huge short-read data set because it can significantly reduce the local storage for the reads, as well as save time for a De Bruijn graph (DBG) construction while waiting for the data being retrieved. npScarf, on the other hand, is a hybrid assembler working on a pre-assembly set of short-read assembly contigs. It functions by scaffolding the contigs 40 using real-time nanopore sequencing. The completion of genome assembly in parallel with the sequencing run provides explicit benefits in term of resource control and turn-around time for analysis [1].
Hybrid approaches are still common practice in genome assembly and data analyses while Illumina sequencing retains cost and accuracy benefits over long-read sequencing. On the other hand, the third-generation sequencing methods such as Pacbio or Oxford Nanopore Technology are well-known for the ability to produce much longer reads that can further 45 complete the Illumina assembly. As a consequence, it is rational to combine two sources of data together in a hybrid method that can offer accurate and complete genomes at the same time. npScarf, following that philosophy, had been developed and deployed on real microbial genomes.
However, due to the greedy bridging approach of the contig-based streaming algorithm, npScarf's results can suffer from mis-assemblies [8,17]. A default setting was optimized for microbial genomes input but cannot fit for all data from 50 various experiments in practice. Also, the gap filling step has to rely on the lower accuracy nanopore reads thus the accuracy of the final assembly is also affected. To tackle the quality issue while maintaining the streaming feature of the approach, a bridging method by assembly graph traversing has been proposed in this manuscript. Our approach uses as its starting point a compact DBG assembly graph, followed by graph-traveseral, repeat resolution and identification of the longest possible un-branched paths that would represents contigs for the final assembly.

55
Hybrid assembler using nanopore data to resolve the graph has been implemented in hybridSPAdes [7] as well as Unicycler [8]. The available tools employ batch-mode algorithms on the whole long-read data set to generate the final genome assembly. The SPAdes hybrid assembly module, from its first step, exhaustively looks for the most likely paths (with minimum edit distance) on the graph for each of the long read given but only ones supported by at least two reads are retained. In the next step, these paths will be subjected to a decision-rule algorithm, namely exSPAnder [18], for repeat resolution by step-by-step expansion, before output the final assembly. On the other hand, Unicycler's hybrid assembler will initially generate a consensus long read for each of the bridge from the batch data. The higher quality consensus reads are used to align with the assembly graph to find the best paths bridging pairs of anchored contigs. 5 While this method employs the completeness of the data set from the very beginning for a consensus step, the former only iterates over the batch of possible paths and relies on a scoring system for the final decision of graph traversal. Hence, in theory it can be adapted to a real-time pipeline. The challenge in adapting graph-based approaches into streaming algorithm comes mainly from building a progressive implementation for path-finding and graph reducing module . To achieve this, we apply a modified DFS (depth-first search) mechanism and a dynamic voting algorithm into an on-the-fly graph resolver.
By testing with synthetic and real data, we have shown that npGraph can generate assemblies of comparative quality compared to other powerful batch-mode hybrid assemblers, such as hybridSPAdes or Unicycler, while also providing the ability to build and visualise the assembly in real-time.

15
Due to the limits of current sequencing technology, application of hybrid methods should remain a common practice in whole genome assembly for the near future. On the other hand, the ONT platforms are evolving quickly with significant improvement in terms of data accuracy and yield and cost. Beside, the real-time property of this technology has not been sufficiently exploited to match its potential benefits. npScarf had been introduced initially to address these issues, however, the accuracy of the assembly output was affected by its greedy alignment-based scaffolding approach. Here 20 we present npGraph, a streaming hybrid assembly method working on the DBG assembly graph that is able to finish short-read assembly in real-time while minimizing the errors and mis-assemblies drastically.
Compared to npScarf, npGraph algorithm employs more rigorous approach based on graph traversal. This might reduce the assembly errors because the bridging method is more accurate so that the reporting results are more reliable. The performance of npGraph is comparable to Unicycler while consuming much less computational resources so that it can work on streaming mode. Also, the integrated GUI allows users to visualize its animated output in a more efficient way. On the other hand, similar to Unicycler, npGraph relies on the initial assembly graph to generate the final assembly. 5 The algorithm operates on the assumption of a high quality assembly from a well-supplied source of short-read data for a decent assembly graph to begin with. It then consumes a just-enough amount of data from a streaming input of nanopore reads to resolve the graph. Finally, extra pre-processing and comprehensive binning on the initial graph could further improve the performance of the streaming assembler.

10
The work flow of npGraph mainly consists of 3 stages: (1) assembly graph pre-processing; (2) graph resolving and simplifying; (3) post-processing and reporting results. The first step is to load the assembly graph of Illumina contigs and analyze its components' property, including binning and multiplicity estimation. The second step works on the processed graph and the long read data that can be provided in real-time by ONT sequencer. Based on the paths induced from long reads, the assembly graph will be resolved on-the-fly. Finally, the graph is subjected to the last 15 attempt of resolving and cleaning, as well as output the final results. The whole process can be managed by using either command-line interface or GUI. Among three phases, only the first one must be performed prior to the MinION sequencing process in a streaming setup. The algorithm works on the assembly graph of Illumina contigs, so the terms contigs and nodes if not mentioned specifically, would be used interchangeable throughout this context.

Contigs binning 20
Contigs should belong to single or multiple groups, or bins, that would represent different assembly units, e.g. chromosome , plasmids, of different species if applying to a metagenomics dataset. A binning step is needed to assign membership of each contig to its corresponding group.
The first step is to cluster the big anchors (longer than 10Kbp and in/out degree less than 2) based on their kmer coverage. To achieve this, we applied DBSCAN clustering algorithm [19], which utilises a customised metric function to 25 map contigs into a one-dimensional space. In order to define a customised metric which is sample and fast to calculate, we assumed that a single long contig itself consists of a Poisson distribution of k-mers count with the mean approximated by the contig's coverage. The metric is then determined by a distance function of two Poisson distributions based on Kullback-Leibler divergence (or relative entropy) between the Poisson distribution representing each contig [20].
Formally, assuming there are 2 Poisson distributions P 1 and P 2 with probability mass functions (PMF) x! The Kullback-Leibler divergence from P 2 to P 1 is defined as: or in other words, it is the expectation of the logarithmic difference between the probabilities P 1 and P 2 , where the expectation is taken using P 1 . The log ratio of the PMFs is: Thus the divergence between P 1 and P 2 is: Thus, the metric we used is a distance function defined as: Independent from the contigs clustering in the pre-processing step, additional evidence of nodes' uniqueness can be 30 acquired using the long reads during the assembly process. Given enough data, the multiplicity of an ambiguous node can be determined based on the set of all bridges rooted from itself. On the other hand, external binning tools such as MetaBAT [21], maxbin [22] can be employed in npGraph as well.

Multiplicity estimation
Now bins of the main unique contigs had been identified, however, they only make up a certain proportion of the contigs set. From here, we need to assign bin membership and multiplicity for all other nodes of the graph, especially the repetitive ones. To do so, we relied on the graph's topology and the estimated read coverage of initial contigs from SPAdes. Given all contigs' coverage values as nodes' weight, we need to estimate those of edges and in return, 5 using them to re-estimate the coverage for repetitive nodes if necessary. After this process, we will have a graph with optimized weighted components that would suggest their multiplicities more exactly. Basically the computation is described as in following steps: 0. Initialize every node weight as its corresponding contig coverage, all edges' weight as zeros.
1. Calculate distributed weights for edges by quadratic unconstrained optimization of the least-square function: where l i and c i is the length and weight of a node i in the graph; 10 e + i and e − i indicates sum of weights for incoming and outgoing edges from node i respectively. They are expected to be as close to c i as possible thus the length-weighted least-square should be minimized.
The above function can be rewritten as: and then being minimized by using gradient method.
2. Re-estimate weights of repetitive nodes based on their neighboring edges' measures and repeat previous optimization step. The weights are calculated iteratively until no further significant updates are made or a threshold of loop 15 count is reached.
At this point, we can induce the copy numbers of nodes in the final assembly. For each node, this could be done by investigating its adjacent edges' multiplicity to estimate how many times it should be visited and from which bin(s). Multiplicities of insignificant nodes (of sequences with length less than 1, 000 bp) are less confident due to greater randomness in sequencing coverage. For that reason, in npGraph, we did not rely on them for graph transformation but 20 as supporting information for path finding.

Building bridges in real-time
Bridge is the data structure designed for tracking the possible connections between two anchored nodes (of unique contigs) in the assembly graph. A bridge must start from a unique contig, or anchor node, and end at another when completed. Located in-between are nodes known as steps and distances between them are called spans of the bridge.

25
Stepping nodes are normally repetitive contigs and indicative for a path finding operation later on. In a complicated assembly graph, the more details the bridge, a.k.a. more steps in-between, the faster and more accurate the linking path it would resolve. A bridge's function is complete when it successfully return the ultimate linking path between 2 anchors.
The real-time bridging method considers the dynamic aspect of multiplicity measures for each node, meaning that 30 a n-times repetitive node might become a unique node at certain time point when its (n − 1) occurrences have been already identified in other distinct unique paths. Furthermore, the streaming fashion of this method allows the bridge constructions (updating steps and spans) to be carried out progressively so that assembly decisions can be made immediately after having sufficient supporting data. A bridge in npGraph has several completion levels. When created, it must be rooted from an anchor node which represents a unique contig (level 1). A bridge is known as fully complete 35 (level 4) if and only if there is a unique path connecting its two anchor nodes from two ends. At early stages (level 1 or 2), a bridge is constructed progressively by alignments from long reads that spanning its corresponding anchor(s). In an example from Figure 1a, bridges from a certain anchor (highlighted in red) are created by extracting appropriate alignments from incoming long reads to the contigs. Each of the steps therefore is assigned a weighing score based on its alignment quality. Due to the error rate of long reads, there should be deviations in terms of 40 steps found and spans measured between these bridges, even though they represent the same connection. A continuous merging phase, as shown in the figure, takes advantage of a pairwise Needleman-Wunsch dynamic programming to generate a consensus list based on weight and position of each of every stepping nodes. The spans are calibrated accordingly by averaging out the distances. On the other hand, the score of the merged steps are accumulated over time as well. Whenever a consensus bridge is anchored by 2 unique contigs at both ends and hosting a list of steps with sufficient coverage, it is ready for a path finding in the next step.
Path finding algorithm Given a bridge with 2 anchors, a path finding algorithm is invoked to find all candidate paths between them. Each of these paths is given a score of alignment-based likelihood which are updated immediately as long as there is an appropriate long read being generated by the sequencer. As more nanopore data arrives, the divergence between candidates' score becomes greater and only the top-scored ones are kept for the next round. We implement a modified 5 stack-based version utilizing Dijkstra's shortest path finding algorithm [23] to reduce the search space when using Depth-First Search.
Due to false alignments from shorter contigs to the long reads, not all of the reported step nodes are necessary to be appeared in the ultimate path resolved by the bridge. In most cases, the accumulated score of each step indicates its likelihood to be the true component of the final solution. For that reason, a strategy similar to binary searching is 10 employed to find a path across 2 anchors of a bridge as shown in Algorithm 2.
Before that, we define Algorithm 1 to demonstrates the path finding algorithm for two nodes given their estimated distance. In which, function shortestTree( − −−− → vertex, distance) : (V, Z) → V n from line 3 of the algorithm's pseudo code builds a shortest tree rooted from − → v , following its direction until a distance of approximately d (with a tolerance regarding nanopore read error rate) is reached. This task is implemented based on Dijkstra algorithm. This tree is used 15 on line 4 and in function includedIn() on line 19 to filter out any node or edge with ending nodes that do not belong to the tree. Basically, the algorithm keeps track of a stack that contains sets of candidate edges to discover. During the traversal, a variable d is updated as an estimation for the distance to the target. A hit is reported if the target node is reached with a reasonable distance i.e. close to zero, within a given tolerance (line 21). A threshold for the traversing depth is 20 set (150) to ignore too complicated and time-consuming path searching.
Note that the length() functions for node and edge are totally different. While the former returns the length of the sequence represented by the node, i.e. contig from short-read assembly, the latter is usually negative because an edge models a link between two nodes, which is normally an overlap (except for composite edges). For example, in a k-mers SPAdes assembly graph, the value of an edge is −k + 1.

25
In many cases, due to dead-ends, there not always exist a path in the assembly graph connecting two anchors as suggested by the alignments. In this case, if enough long reads coverage (20X) are met, a consensus module is invoked and the resulting sequence is contained in a pseudo edge.  presents an example of the results before and after graph resolving process in the GUI. The result graph, after cleaning, would only report the significant connected components that represents the final contigs. Smaller fragments, even unfinished but with high remaining coverage, are also presented as potential candidates for further downstream 10 analysis. Further annotation utility can be implemented in the future better monitoring the features of interests as in npScarf.

Result extraction and output
npGraph reports assembly result in real-time by decomposing the assembly graph into a set of longest straight paths (LSP), each of the LSP will spell a contig in the assembly report. The final assembly output contains files in both 15 FASTA and GFAv1 format (https://github.com/GFA-spec/GFA-spec). While the former only retains the actual genome sequences from the final decomposed graph, the latter output file can store almost every properties of the ultimate graph such as nodes, links and potential paths between them.
A path p = {v 0 , e 1 , v 1 , . . . , v k−1 , e k , v k } of size k is considered as straight if and only if each of every edges along the path e i , ∀i = 1, . . . , k is the only option to traverse from either v i−1 or v i , giving the transition rule. To decompose the graph, the tool simply mask out all incoming/outgoing edges rooted from any node with in/out degree greater than 1 as demonstrated in Figure 5. These edges are defined as branching edges which stop straight paths from further extending. Figure 5: Example of graph decomposition into longest straight paths. Branching edges are masked out (shaded) leaving only straight paths (bold colored) to report. There would be 3 contigs extracted by traversing along the straight paths here.
The decomposed graph is only used to report the contigs that can be extracted from an assembly graph at certain time point. For that reason, the branching edges are only masked but not removed from the original graph as they 5 would be used for further bridging.
Other than that, if GUI mode is enabled, basic assembly statistics such as N50, N75, maximal contigs length, number of contigs can be visually reported to the users in real-time beside the Dashboard. The progressive simplification of the assembly graph can also be observed at the same time in the Graph view.