Molecular underpinnings of ssDNA specificity by Rep HUH endonucleases and implications for HUH-tag multiplexing and engineering

Replication initiator proteins (Reps) from the HUH-endonuclease superfamily process specific single-stranded DNA (ssDNA) sequences to initiate rolling circle/hairpin replication in viruses, such as crop ravaging geminiviruses and human disease causing parvoviruses. In biotechnology contexts, Reps are the basis for HUH-tag bioconjugation and a critical adeno-associated virus genome integration tool. We solved the first co-crystal structures of Reps complexed to ssDNA, revealing a key motif for conferring sequence specificity and anchoring a bent DNA architecture. In combination, we developed a deep sequencing cleavage assay termed HUH-seq to interrogate subtleties in Rep specificity, and demonstrate how differences can be exploited for multiplexed HUH-tagging. Together, our insights allowed us to engineer a Rep chimera to predictably alter sequence specificity. These results have important implications for modulating viral infections, developing Rep-based genomic integration tools, and enabling massively parallel HUH-tag barcoding and bioconjugation applications.


Introduction
HUH-endonucleases are diverse enzymes utilizing common ssDNA processing mechanisms that break and join DNA to facilitate fundamental biological processes such as rolling circle replication, bacterial conjugation, DNA transposition, and DNA integration into host genomes (1)(2)(3) . At the heart of DNA processing of all HUH-endonucleases is a structurally defined catalytic nickase domain that first recognizes a specific sequence/structure of DNA; nicks ssDNA to yield a sequestered 5' end that remains covalently bound to the HUH endonuclease and a free 3'OH that can be used as a primer for DNA replication; and finally, it facilitates a strand transfer reaction to resolve the covalent intermediate (1) .
The covalent phosphotyrosine intermediate has recently been exploited for biotechnology applications. "HUH-tag" fusion proteins are emerging as a versatile bioconjugation platform to covalently link proteins to DNA, combining the diverse functionality of proteins with the programmability of DNA ( (4) . HUH-tag applications have permeated into technologies such as DNA origami scaffolded protein assembly (5)(6)(7)(8) , receptor-specific cell targeting by adeno-associated virus (9) , aptamer-based sandwich detection (10) , directed nanoparticle drug-delivery via DNA aptamers (11) , and CRISPR-Cas9 genome engineering (12,13) , mainly due to their ability to form robust covalent adducts under physiologic conditions. Rather than relying on expensive nucleic acid modifications such as the SNAP-tag (14) , CLIP-tag (15) , and Halo-tag (16) systems, HUH-tags rely on an inherent single-stranded DNA (ssDNA) binding moiety that promotes the catalysis of a transesterification reaction resulting in a stable phosphotyrosine adduct (1) .
Understanding the molecular basis of DNA recognition by HUH-endonucleases could provide much needed solutions for bacterial antibiotic resistance resulting from HUH-endonuclease mediated horizontal gene transfer (17) , as well as prevention or treatment of HUH-endonuclease mediated viral infections, such as geminivirus infections of plants that ravage the agricultural crop industry (18,19) and parvovirus B19 infections of humans (20) that are associated with a range of autoimmune diseases (21,22) . Moreover, the ability to rationally engineer HUH-endonucleases to recognize a desired DNA sequence has huge potential in genome engineering (23) and DNA delivery applications as well as in expanding the multiplexibility of HUH-tagging to meet the demand of the recent explosion of DNA-barcoding applications (24)(25)(26)(27) .
However, while several structures of relaxase HUH-endonucleases involved in bacterial conjugation in complex with their cognate DNA target sequences have been reported (17,(28)(29)(30) , there are no structures of the other major class of HUH-endonucleases, replication initiator proteins (Reps), involved in rolling circle and hairpin replication in complex with ssDNA. Despite structurally superimposable active sites and a common overall core structure (31) , several structural elements of relaxases that do not exist in Reps (extensions of the C-terminus or internal loops with respect to Reps) are involved in extensive contacts with the target DNA underscoring potential differences in DNA recognition mechanisms between Reps and relaxases (32) .
In this study, we determined the structural basis for ssDNA recognition by viral Rep HUH-endonucleases by solving two Rep-ssDNA co-crystal structures and highlight a ssDNA "bridging" motif largely responsible for recognition. To further interrogate the ssDNA specificity of Reps, we developed HUH-seq, a high-throughput, next generation sequencing (NGS)-based DNA cleavage assay used to define ssDNA recognition profiles of a panel of ten Reps using a ssDNA library containing 16,384 different target sequences. Despite the high similarity of cognate nonanucleotide ori sequences and the promiscuous nature of Rep ssDNA recognition, HUH-seq analysis surprisingly revealed many examples of orthogonal adduct formation between Reps from different viral families with little to no cross-reactivity. Finally, we rationally engineered a chimeric Rep by swapping a piece of the ssDNA "bridging" motif of one Rep into the backbone of a related Rep, predictably modulating ssDNA sequence specificity.

REP HUH-ENDONUCLEASE CO-CRYSTAL STRUCTURES
To uncover the ssDNA recognition mechanism of Reps and identify potential motifs that might confer sequence specificity, we solved the first high resolution crystal structures of two distinct Rep HUH-endonuclease domains bound to ssDNA encoding minimal ori sequences. The pre-cleavage state was captured by mutating the catalytic tyrosine to phenylalanine. We present structures of two 10-mer bound structures of inactive PCV2 Y96F (1.93 Å resolution) and WDV Y106F (1.80 Å resolution), and one 8-mer bound WDV Y106F (2.61 Å resolution) structure ( Table 1). The three structures are in complex with the divalent cofactor manganese with the catalytic tyrosine (though a Phe mutant in the structures) positioned for nucleophilic attack of the scissile phosphate ( Fig. 1a and 1b, Supp. Note 1) and share a highly similar ssDNA binding interface, albeit with several distinct features.  Fig 1). Strikingly, despite the absence of hairpin-stem bases in the short DNA target oligos in the co-crystal structures, the ssDNA is bent into a "U-shaped" architecture like one might expect in the context of the hairpin loop. The U-shaped DNA sits in a shallow channel on the surface of one face of the Rep protein, with a distinct topological "nose" that juts out in the center of the U. The bent conformation of the ssDNA in the Rep structures is driven by both interactions with the topological "nose" of the protein and Watson-Crick base pairing between T -4 and A +1 (Fig. 1a and 1b) and an adjacent hydrogen bond between N3 of T -1 and N3 of A -3 . Moreover, energetically favorable base stacking occurs between 5 nucleotides at positions -6 through -2. These intramolecular, conformation stabilizing interactions, along with protein-nucleotide interactions, promote proper orientation of the 5' phosphate of the position +1 nucleotide for catalysis. a , Surface representation of PCV2 Y96F colored in tan and b, WDV Y106F colored in gray bound to manganese as a sphere in magenta and DNA 10-mers as sticks colored orange by element. PCV2 Y96F is bound to 10-mer (5'-dTAGTATTACC-3') and WDV Y106F is bound to 10-mer (5'-dTAATATTACC-3') both adopting a U-shaped conformation. Nucleotides are labeled as single letter abbreviations and positions, indicated as subscripts, relative to the scissile phosphate in yellow. A dashed gray curve indicates the base stacking chain that occurs between positions -6 through -2. Intramolecular Watson-Crick (WC) base pairing between A +1 and T -4 are indicated by red dashed lines as well as a non-canonical hydrogen bond between T -1 and A -3 are indicated as a black dashed line. Active site side chains are indicated as sticks, PCV2 Y96F in cyan and WDV Y106F in green by element. The PCV2 Y96F active site coordinates the manganese in an octahedral geometry using Glu48, His57, and Gln59 with a water and two oxygens of the scissile phosphate completing the coordination shown as black dashed lines. The WDV Y106F active site coordinates the manganese in an octahedral geometry using Glu110, His59, and His61 with two oxygens of the scissile phosphate completing the coordination shown as black dashed lines. The active site is displayed within the 2Fo-Fc map mesh at σ = 2.
To analyze the contacts between protein and DNA facilitating sequence-specific ssDNA recognition, we generated protein-nucleotide interaction maps utilizing the DNAproDB platform (33,34) , which reports contacts within 4 Å between protein and ssDNA (Supp. Fig. 2). The relative positions of residues directly involved in forming the ssDNA docking interface, the catalytic tyrosine, and the divalent metal coordinating residues of the 10-mer bound Rep structures are depicted as a cartoon ( Fig. 3a and 3b) and mapped onto structure-based alignment of several Reps colored by residue percent identity conservation (Fig. 3c). The structural positioning of residues involved in protein-DNA contacts in the PCV2 and WDV are nearly conserved and are bolded in the alignment, while the residue identity is more divergent. A majority of the ssDNA docking interface is created by a stretch of 9-10 consecutive residues which correspond to the topological "nose" sticking up in the middle of the U, comprising an observed turn-4-turn structural motif that resides within a previously defined region termed the geminivirus recognition sequence (GRS) (Fig. 3c) (35) . A second prominent cluster of protein-DNA contacts reside within Motif I, which is implicated in DNA binding (36,37) .

DEFINING THE SINGLE-STRANDED DNA BRIDGING MOTIF (sDBM)
The consecutive stretch of 9-10 residues in the turn-4-turn structural motif ( 'ARCHIEKAKG' for PCV2 and 'HPNIQAAKD' for WDV) has two critical functions in the structure. First, it acts as a "bridge" between 5' and 3' ends of the nonanucleotide sequence, contacting DNA positions -6, -5, +1, and +2. (Fig 3d and  3e). In combination with the intramolecular base pairing and hydrogen bonding of the ssDNA, this sequence of residues likely contributes to bending and stabilizing the ssDNA into the U-shaped conformation. In the WDV Y106F + 10-mer structure, residues His91 and Asp93 in this "bridging" motif specifically contact the base of A -5 ( Fig. 3d and 3e), whereas Arg79 and His81 in the PCV2 structure specifically contact the base of G -5 . We hypothesize that these specific contacts play a major role in the difference in specificity at the -5 position. While this sequence motif is fairly conserved within the geminivirus family, known as the geminivirus recognition sequence (GRS) (35) , it is thus far undefined as a motif within the entire Rep family because of sequence diversity between classes (38) , which we predict plays a major role in differential sequence specificity. With this, we term this turn-4-turn structural motif as the 'single-stranded DNA Bridging Motif' (sDBM), and suggest it is the main binding moiety responsible for recognition and conformation priming of ssDNA by Rep HUH-endonucleases. Unsurprisingly, the sDBM is involved even in ssDNA binding of relaxases (Supp. Fig. 3a and 3b) structures depicted as 2D cartoons with relative positions of residues (green) involved in binding "U-shaped" ssDNA within 4 Å. The catalytic tyrosine 106 is indicated in red with the adjacent phosphate in yellow, the ion coordinating triad is indicated in blue, and the 2+ ion in purple. The single Watson-Crick (WC) bases pair is indicated as a dashed red line and ssDNA intramolecular hydrogen bond is indicated as a black dashed line. c, Structural alignment of Reps using PROMALS3D including available PCV2 (PDB: 6WDZ), WDV (PDB: 6WE0), TYLCV (1L2M), and FBNYV (6H8O) structures (39)(40)(41) as templates with conserved residues highlighted -high or absolute conservation (≥90%) indicated in red; moderately conserved (≥70%) indicated in orange; low conservation (≥50%) indicated in yellow; and no conservation (<50%) indicated in white. Amino-and carboxy-terminal ends are trimmed to reflect only structured domains in crystal structures. Bolded residues indicate contacts within 4 Å of DNA 10-mers complexed with PCV2 Y96F (PDB: 6WDZ), and WDV Y106F (PDB: 6WE0). Conserved Rep Motifs I/II/III are shown as well as the GRS motif within the dashed box for geminivirus Reps. The sDBM we have defined in this study is labelled and highlighted in green. The conserved secondary structural elements (β1-5 and α1-2) below the alignment sequences are approximately shown as 2D cartoons with labeled HUH/Q motif and catalytic tyrosine. d, Surface representation of WDV with sDBM highlighted in green bound to the 10-mer as sticks. e, Major polar interactions between WDV sDBM residues (green sticks) and bases of 10-mer (orange sticks) are shown as yellow dashes.

REP VERSUS RELAXASE ssDNA RECOGNITION MECHANISM
Reps generally initiate replication of a large number of viruses and plasmids to copy their circular genomes while relaxases catalyze the transfer of one DNA strand of the plasmid genome to the recipient cell during plasmid conjugation (1) ; thus, relaxases are thought to recognize DNA with more specificity than Reps. Our structures provide molecular level insights into different modes of recognition between Reps and relaxases, which should illuminate structural nuances between how ssDNA is recognized. The two available relaxase structures most comparable to the Rep co-crystal structures are TraI (PDB ID: 2A0I) and TrwC (PDB ID: 2CDM), which are both complexed with ssDNA and have at least one nucleotide bound on the 3' side of the nic site ( Fig. 3c and 3d). Structurally, Reps and relaxases share a similar central 5-stranded antiparallel beta-sheet displaying the HUH motif, though the relaxases are circularly permuted with respect to the Reps such that the catalytic tyrosine is near the C-terminus of Reps and the N-terminus of relaxases (31) . Relaxases have similar active sites and U-shaped ssDNA architectures to Reps (28,30) . However, there are striking differences in how the two families of proteins recognize DNA. Aside from the most obvious difference that that relaxase proteins are substantially larger than Rep proteins, and provide a more extended DNA binding interface to include binding a hairpin structure 5' to the nic site, the most conspicuous difference is that the two relaxase structures contain a protein alpha-helical "clasp" that covers the bound DNA (Fig. 3). The clasp forms extensive contacts with the DNA, suggesting that it helps anchor the DNA to the protein. This is underscored by the fact that in the crystal structure of the relaxase from staphylococcus aureus NES, which does not contain a "clasp", the 3' end of the DNA has very few contacts with the protein (17) .
Moreover, the DNA in relaxase proteins is embedded in a much deeper channel than in Rep proteins. Indeed, calculations of buried solvent accessible surface area (BASA) between protein and DNA reveal a more substantial surface area buried in the binding of DNA to relaxases, even when accounting for the surface area buried by the clasps (Fig 3). Both Rep and relaxase structures have distinct structurally conserved pockets in the ssDNA docking interface in which individual nucleotide bases are bound. In all structures, the sDBM is a major contributor to the formation of these pockets, which is part of 1 in relaxases and 4 in Reps. TraI and TrwC bury nucleotides -5 and -3 in strikingly deep pockets, #1 and #2, respectively (Fig. 3). Reps have pockets in this structural region, yet are much more shallow and minimally bury nucleotides at -4 and -6 positions. A -6 is bound in the deepest of these Rep pockets, yet is still oriented in a configuration that favors base stacking with neighboring nucleotides rather than a 'knob-in-pocket' interaction as seen in both TrwC and TraI structures. Conversely, both Reps have a deep pocket #3 where the +2 cytosine base is buried, however only the TraI structure contains the positioning of the +2 base, which is interestingly not bound in this pocket (Fig. 3). are either solid or transparent and outlined in black. Total buried solvent accessible surface area (BASA; Å 2 ) for ssDNA bound to the docking interface were calculated for each structure including values for with, or without, contribution from relaxase claps.

HUH-SEQ REVEALS SUBTLE DIFFERENCES IN REP ssDNA RECOGNITION SPECIFICITY
Structural analysis of the Rep protein-DNA contact maps point to subtle differences that contribute to recognition of nearly identical nonanucleotide sequences, suggesting that Reps may differentially tolerate substitutions in the target DNA sequence. Thus we developed a NGS-based cleavage assay approach, HUH-seq, to examine both ssDNA specificity and explore the use for Reps for use in multiplexed HUH-tag applications. As a first step in assessing the ssDNA recognition specificity of Reps, we asked whether viral Rep proteins from different families and genera differentially tolerate mutations in the target nonanucleotide sequence measuring covalent adduct formation by a standard in vitro SDS-PAGE cleavage assay (Supp. Fig. 4). However, it became immediately evident that a low-throughput assay would insufficiently characterize specificity due to widespread toleration of variable target sequences. A large number of truncations and substitutions within the nonameric sequence resulted in negligible effects on adduct formation in many cases (a full analysis of the small oligo library screen is provided (Supp. Fig.  4b and 4c). This realization prompted us to devise a high-throughput method that would reveal unbias ssDNA recognition profiles for each Rep.

Table 2: Panel of 10 recombinantly expressed and purified Reps
Metrics list of Rep panel used in this study describing abbreviated HUH-tags names, viral species, family, and genus, MW and isoelectric point (p I ) of the recombinant Rep domain (negating contribution from His6x-SUMO-tag), and the cognate nonanucleotide sequence from each respective viral origin of replication ( ori ).
We developed an NGS-based approach, HUH-seq, to establish comprehensive ssDNA recognition profiles of the Reps contained within a randomized ssDNA library containing 16,384 sequences, or k -mers. In brief, the first seven positions of the nonanucleotide target sequence are randomized in the 7N ssDNA library, where positions A +1 and C +2 are constant ("7N" -N -7 N -6 N -5 N -4 N -3 N -2 N -1 *A +1 C +2 ). The library was constrained to only seven positions to limit the size of the library; further design considerations are discussed in the supplementary information (Supp. Fig 5, Supp. Note 2, and Supp. Equations 1 and 2). In a single HiSeq run, target sequences cleaved by the Reps were detected with high sensitivity and confidence. Reps were individually reacted with the 7N ssDNA library under standard conditions and produced two populations of the library: "sequence cleaved" and "uncleaved". A primer set containing Nextera adapters was used to generate the antisense strand and amplify the "uncleaved" population in a single PCR step, while the "sequence cleaved" population remained unamplified. "Uncleaved" amplicons were barcoded with standard dual-indices and sequenced to obtain read counts for every sequence ( k -mer). Read counts from reference replicates (no Rep added to the reaction) were used to calculate log 2 -fold-change (FC) and read count percent reduction based on the difference between the normalized reference library read counts and normalized "uncleaved" read counts for each Rep treatment (Fig. 2).

Fig. 4: HUH-seq cleavage assay schematic for determining Rep sequence specificity
Schematic describing HUH-seq: an NGS-based approach for quantifying ssDNA specificity profiles of Reps. A synthetic ssDNA library containing 7 random bases (4 bases^7 positions = 16,384 unique kmers) (yellow) flanked by constant regions (gray) and primer binding sites (PBS) (dark gray) are reacted with a panel of Reps, or no enzyme as a reference, in replicate, generating a two part pool containing the "uncleaved" library and the "sequence cleaved" library for each reaction. In a single PCR step, the antisense strand for the "uncleaved" pool is generated, amplified, and Nextera adapters (purple) are added with primer overhangs; the "sequence cleaved" library is not amplified due to physical separation of the PBS's. Each set of amplicons is then barcoded with standard i7/i5 Illumina indexing sequences (green) and pooled for a single next generation sequencing run. A custom R-based analysis script generates read counts for all k -mers in each set of replicates, then normalizes based on total read count, and quantifies k-mer cleavage extent of each Rep in the panel based on fold change and percent reduction.
We generated weighted sequence logos based on a k -mer reduction analysis with a threshold value of 0.3 or greater to reduce noise (Fig. 5). Percent reduction for each k -mer was calculated by comparing the normalized k -mer read counts for each Rep treatment in triplicate to k -mer read counts from the reference library. For each position in a Rep sequence logo, individual characters were scaled by the average percent reduction of all k -mers containing that character and position. Because every sequence permutation 5' of the nic site is present in the 7N ssDNA library, sequence logos reveal Rep preferences for nucleotides relative to one another. The most obvious results is that the most preferred nucleotides in the first seven positions of sequence logos are nearly identical to the cognate nonanucleotide ori sequence found in each respective virtual genome (Fig. 5). Though it is not surprising that the preferred target sequence is the cognate nonanucleotide ori sequence cleaved in vivo , it gives high confidence that HUH-seq can be used to quantitatively rank the k -mers cleaved by each Rep, analyze patterns that dictate these ssDNA recognition profiles, and further characterize differences between individual Reps.
Within each sequence logo, there are differentially preferred nucleotide positions. Positions T -4 and T -1 are almost unanimously the most preferred, while conservation of the A -3 and T -2 positions has a relatively low impact on cleavage. There are also discernible trends between Reps from different families; for example, geminivirus Reps have a strong preference for adenine at the -5 position, whereas the others prefer thymine or guanine (Fig. 5). The y-axis scale of the weight sequence logos also indicates the relative overall cleavage efficiency between Reps. For example, PCV2 has a maximum average percent reduction of about 0.35 and cleaves about 10-fold more sequences than FBNYV, which has a maximum value of about 0.035. This indicates that PCV2 ssDNA recognition is more promiscuous than that of FBNYV. CpCDV has the highest maximum average percent reduction of 0.8 and has minimal nucleotide preference, indicating it has the most relaxed sequence specificity (Fig. 5). Other considerations and caveats of HUH-seq analysis are discussed in supplemental information (Supp. Note 4 and Supp. Fig. 6).

Fig. 5 : Weighted sequence logos generated from HUH-seq cleavage data
Weighted sequence logos for nine of the ten HUH-tags based on percent reduction with under 0.3 set to 0.0 to remove noise obtained from the HUH NGS cleavage assay generated using ggseqlogo. Heights are scaled to represent the average percent reduction of each base at each position when compared to the reference library. Sequences in black below each logo are the cognate nonanucleotide ori sequences from each respective virus. Logos are organized by viral families as labeled inside the gray boxes.

REP ssDNA RECOGNITION PROFILES CORROBORATE STRUCTURAL OBSERVATIONS
Next, we quantified and assigned contributions of the ssDNA docking interface in the Rep structures to each nucleotide using DNAproDB by calculating the BASA as well as the total protein-DNA contacts (the sum of hydrogen bonds and Van der Waals interactions within 4 Å). Figure 6a and 6b summarize the total BASA for and the total number of contacts with nucleotides corresponding to the cognate nonanucleotide ori sequence either with the entire nucleotide or the base only. These measurements in combination with the ssDNA recognition profiles of WDV and PCV2 were used to search for structural reasons why nucleotides in certain positions of the target sequence are conserved. A comprehensive table containing BASA and contact values of each of the three structures featured in this study is also provided (Supp. Table 1). As expected, higher BASA values correlated to high numbers of contacts in general .
PCV2 bound 10-mer and WDV bound 10-mer structures have a similar number of total residues contacting DNA, 28 and 26 residues, respectively, and have a high concentration of base contacts and total contacts near the 5' and 3' termini of the 10-mers ( Fig. 6a and 6b). In Figure 6c -6h, significant structural differences are highlighted between the contacts of nucleotides at different positions for both PCV2 (c, e, and g) and WDV (d, f, and h). A -3 and T -2 are the least conserved nucleotides at their indicated positions and have zero contacts with the bases for both PCV2 and WDV, indicating specific nucleotides are not as prefered because the interactions are exclusively with the ribose and phosphate Asn22, Asn 23, and Thr55 for PCV2 (Fig. 6e) and Pro21, Gln22, and Ser57 for WDV, rather than directly contacting the base of the nucleotide (Fig. 6f). The 10-mer bound to PCV2 differs at position -5 between guanine and adenine with respect to the 10-mer bound to WDV. His91 and Asn93 of WDV facilitate polar contacts with A -5 may give WDV more specificity at position -5, whereas only one polar contact with G -5 by His82 results in less stringent specificity by PCV2. (Fig. 6c and 6d). Finally, C +2 in both structures dwell in a pocket of the protein surface evident by the highest BASA and total contact values ( Fig. 6g and 6h). Eight residues have contacts with C +2 in both structures, and 5 of these residues make up the last positions of the sDBM.
In contrast, T -4 is highly conserved as evident in all Rep ssDNA recognition profiles, but we observed only a marginal number of contacts with the base itself (Fig 6a and 6b). We hypothesize that the WC base pairing of T -4 with A +1 is a major contributor to the U-shaped conformation rather than contributing to sequence specificity via residue interactions with the base. Though Reps have specific interactions with bases that contribute to specificity, it is clear from the ssDNA recognition profiles that Rep cleavage is also promiscuous, cleaving a wide range of sequences target sequences.

Fig. 6: Comparison of Rep protein-DNA interactions and HUH-seq specificity profiles
a, . PCV2 + 10mer (6WDZ) and b, . WDV + 10mer (6WE0) BASA values and total number of protein-DNA contacts compared to weighted sequence logos from HUH-seq analysis. Both polar and van der Waals interactions are counted within 4 Å.. c-h , Atomic interactions between highlighted nucleotides within the bound 10-mers of PCV2 and WDV structures are shown with yellow dashes for polar contacts and gray dashes for van der Waals interactions within 4 Å. PCV2 cartoon is depicted in tan and residues interacting with DNA as sticks shown in cyan colored by atom. WDV cartoon is depicted in gray and residues interacting with DNA as sticks shown in green colored by atom. The PCV2 Phe96 and WDV Phe106 represent the catalytic tyrosine as sticks in red, divalent ion coordinating residues are in blue, and the manganese ion is magenta as a sphere. The 10-mer is shown as sticks in orange and colored by atom as highlighted in the panel.

DISCOVERING INNATELY ORTHOGONAL REP TARGET SEQUENCES USING HUH-SEQ
During initial assessments of the HUH-seq analysis results, we noticed individual target k -mers with drastically different log 2 FC values between different Rep protein treatments. This prompted us to ask whether we could identify pairs of k-mers that would allow us to selectively label two Reps in a single reaction mixture with unique oligos. For instance, k -mer, AGTCAAT (#2884) has a log 2 FC value of -3.44 for PCV2 and a near zero log 2 FC for every other Rep (Supp. Fig. 7). This result was validated using the standard in vitro cleavage assay by reacting PCV2 Rep with a synthetic oligo containing this k -mer sequence. Indeed, only PCV2 formed a covalent adduct with the oligo harboring this target sequence (Supp. Fig. 7). Interestingly, this target sequence contains 4 substitutions with respect to the circovirus ori sequence at positions -6, -5, -4, and -2, again highlighting the promiscuous nature of Reps. This indicated that searching for combinations of Reps and k -mers may result in the discovery of naturally occuring orthogonality despite cross-reactivity between cognate nonanucleotide ori sequences.
To explore the possibility of naturally occurring orthogonality between two wild-type Reps, we wrote a script to extract pairs of k -mer sequences and Reps predicted to not have cross-reactivity based on log 2 FC values. Figure 7a displays a summary heatmap of the number of such k -mer pairs existing for every set of Rep pairs, based on threshold values of -0.3 log 2 FC and greater (likely forming no adduct) and -3.0 log 2 FC and lower (likely having high adduct formation). In one example, we identified the k -mer sequence, CATTTCT (#5112), in which DCV had a -4.13 log 2 FC and WDV had a -0.33 log 2 FC, and another k -mer sequence, TAAATCT (#12344), in which DCV had a -0.20 log 2 FC and WDV had a -4.11 log 2 FC, indicating orthogonality between DCV and WDV for these two k -mers. We validated this observation with a standard in vitro cleavage assay including a short time course with 1, 5, and 10 minute time points. DCV formed about 97% adduct with a synthetic oligo harboring k -mer #5112 over the course of 5 min, and WDV formed about 62% adduct with a synthetic oligo harboring k -mer #12344 over the course of 10 min. As expected, no cross reaction was observed between WDV with k -mer #5112 or DCV with k -mer #12344 (Fig. 7a).
We next searched for triple orthogonal sets of Reps from our panel. As an example, the set containing k -mer sequences, #1280, #4624, and #12344, are predicted to react orthogonally with DCV, BBTV, and WDV recombinant Reps, respectively, as indicated by log 2 FC values (Fig. 7b). Similar to our method for validating double orthogonal sets, we tested the orthogonality of this set using the standard in vitro HUH-tag reaction and calculated percent adduct formed with each combination of k -mer and Reps over a short time course. Expected orthogonality was achieved with over 50% covalent adduct formation after 30 minutes for each of the three HUH-tags with 0 -9% crossreactivty identified (Fig. 7b).
Notably, 23 of the 28 HUH-tags sets from different viral families contained significant k -mer pairs likely to be orthogonal, yet there were no instances of orthogonal k -mer pairs for Reps derived from the same viral family (Fig. 7a). Hence, the ssDNA binding moieties of HUH-tags within the same family may be too similar to yield orthogonal adduct formation. This is a curious result in the case of DCV and BBTV, which are from different Rep families but recognize identical cognate nonanucleotide ori sequences, as we identified 294 potentially orthogonal k -mer pairs (Fig. 7a). This indicates that perhaps DCV and BBTV recognize the same cognate sequence with different interactions allowing for divergent specificity at each nucleotide position. Indeed, 6 out of 9 residues in the sDBM are different between DCV and BBTV Reps. Together, using HUH-seq, we can pick out subtle differences in Rep specificity to extract double and triple orthogonal k -mers and Reps sets that can be used in multiplexed HUH-tag technologies, potentially negating the need for, or to be used in combination with, the larger and slower relaxases or commercial fusion tags.

RATIONAL DESIGN OF A WDV CHIMERA CONFERS PCV2-LIKE SEQUENCE SPECIFICITY
The identification of the sDBM that we hypothesized was responsible for sequence specificity in Reps as well as the discovery of pairs of target sequences that should not cross-react between two Reps inspired us to attempt to swap sequence specificities between two Reps. We swapped out the first five amino acids of the WDV sDBM for those of PCV2, creating a WDV chimera (WDVc1) as a proof-of-concept that Rep specificity could be altered by rational design in a predictable manner (Fig. 7c). Because many of the amino acid side chains in both Rep structures have direct contacts with bases in the 5' end of the ssDNA, we hypothesized WDVc1 to have sequence specificity more closely reflecting that of PCV2. We first identified a pair of HUH-seq predicted target sequences for PCV2 and WDV, where WDVc1 reacts with the k -mer #768 (oP), predicted to only react with PCV2, to a greater extent than k -mer #12318 (oW), predicted to only react with WDV (Fig. 7d). Further, WDVc1 reacts robustly with the cognate nonameric sequence of PCV2, k -mer #720 (wtP), as well as the cognate nonameric sequence of WDV, k -mer #12496 (wtW), similar to PCV2 (Fig. 7c). Thus, we show how the sDBM is a key feature of Reps that may be rationally engineered to predictably alter sequence specificity.

Fig. 7: Discovery of orthogonal Rep target sequences and rational engineering of Rep specificity
a, Heatmap displaying the number of k -mer pairs for a specific HUH-tag set likely to be orthogonal using an asymmetric log 2 FC threshold based on values from HUH-seq analysis, blank cells indicate zero such k -mer pairs. The threshold values are set to log 2 FC values greater than -0.3 (indicating no k -mer cleavage) and log 2 FC values less than -3 (indicating high k -mer cleavage). Each cell of the heatmap represents the total number of possible k -mer pair combinations likely to be orthogonal for a particular set of two Reps. Sets are based on this asymmetric threshold in which the first Rep in the set has high cleavage of one k -mer in the pair and no cleavage of the other k -mer in the pair -vise versa for for the second Rep in the set, indicating orthogonality. As an example, one k -mer pair (#5112 and #12344) of the 2200 possible combinations indicated by the WDV vs. DCV cell were synthesized in the context of the flanking regions of the 7N ssDNA library and cleavage orthogonality was validated using the standard HUH in vitro cleavage assay. Recombinant WDV and DCV were reacted under standard conditions over a short time course with synthetic oligos harboring k -mers sequences #5112 or #12344. Percent covalent adduct was calculated. b, Set of three Rep and corresponding orthogonal set of three k -mer sequences as indicated by log 2 FC values from HUH-seq analysis. Oligos synthesized harboring k -mer sequences (#1280, #4624, #12344) were reacted with DCV, BBTV, and WDV recombinant HUH-tags at room temperature with 1.5x molar excess oligo over HUH-tag protein over a short time course. c, A schematic illustrating the construction of the WDV chimera (WDVc1) containing the first five amino acids of the PCV2 sDBM. d, The heatmap displays HUH-seq log 2 FC values for PCV2 and WDV reactivity with cognate nonanucleotide sequences ( k -mers #720 and #12496) and a pair of k -mers (#768 and #12318) predicted to react orthogonally. The k -mers #720 (wtP), #12496 (wtW), (#768 (oP), and #12318 (oW) were synthesized in the context of the flanking sequences of the 7N ssDNA library and reacted in a 5x molar excess with PCV2, WDV, and WDV chimera (WDc1) recombinant protein for 30 min and 37℃.

Discussion
We report the first two co-crystal structures of Rep HUH-endonucleases complexed to recognized ssDNA in a pre-cleavage state and developed a complementary NGS cleavage assay termed HUH-seq to determine how ssDNA specificity is encoded. Based on our findings, we discuss new insights into the HUH-endonuclease mechanism of DNA processing, engineering Reps for designer sequence specificity, and multiplexed HUH-tag applications.
Rep HUH-endonucleases initiate and terminate RCR or RHR, and even facilitate integration of foriegn DNA into target sites of the host genome, by cleavaging and rejoining DNA in cooperation with multimerization and helicase domains (1,(42)(43)(44)(45)(46) . Rep involvement in replication is a longstanding area of research as Reps are the only known viral components absolutely necessary for these processes (47) , yet the protein interfaces responsible for recognition of ssDNA for initiation function are poorly characterized. For their size, Reps have the remarkable ability to specifically recognize both dsDNA and ssDNA using adjacent protein surface interfaces. Reps have a well defined interface that binds dsDNA downstream of the nic site using one interface that initiates localized melting of DNA (48)(49)(50) , and uses a second ssDNA docking interface we elucidated near the active site to bind within a hairpin loop containing the cognate nonanucleotide ori sequence. Additionally, RHR mediating Reps, also commonly referred to as NS1 proteins, like AAV Rep may use a third interface to recognize the tip of inverted terminal repeats (ITRs) to enhance the cleavage of the terminal resolution site (trs) (48) .
Through structural analysis of PCV2 and WDV Rep co-crystal structures, we uncovered the ssDNA docking interface exposing arguably the most important network of interactions necessary for Rep-mediated replication and integration. Reps conform ssDNA into a common U-shaped DNA architecture using interactions mainly facilitated by a structural motif we termed the sDBM. The sDBM carries out ssDNA bending principally by burying a nucleotide 3' of the nic site into a deep pocket and looping the 5' end of the recognized sequence back around into the U-shape conformation. Both the U-shape of the ssDNA and surface pockets are structurally conserved elements with respect to previously reported structures of their relaxase counterparts, TraI and TrwC, which facilitate plasmid transfer conjugation (28,30) . An important distinction is that these conserved pockets are much deeper in relaxase surfaces than in Rep surfaces, likely resulting in higher specificity (32) . One reason relaxases may have higher specificity for the DNA sequence 5' of the nic site is to efficiently catalyze the rejoining of the free 3'OH of the DNA post-transfer, whereas RCR resolution likely requires a second dimerizing Rep for termination (51) . Rep specificity could also simply be constrained due to a smaller interface surface area because of limited gene size, as most viral Rep genomes are under five kilobases (52) .
The Rep structures inspired the development of an NGS-based cleavage assay approach that would allow for the quantitative readout of Rep cleavage specificity using a ssDNA library. We expected the combination of structural data and a sensitive approach for the profiling of Rep specificity would especially give valuable protein engineering insights for Reps as bioconjugation fusion tags and other Rep-based biotechnology applications. Many aspects of HUH-seq analysis corroborate the structural analysis, notably that positions -5 and -6 confer the most specificity via protein residue and nucleotide base contacts, while nucleotide positions -2 and -3 conferred the least specificity due to a lack of contacts with nucleotide base contacts. Excitingly, HUH-seq can be used to distinguish subtle differences between Rep nucleotide preference despite overall lack of specificity, so much so that innate orthogonality between non-cognate target sequences can be extrapolated between Reps from different, yet closely related, viral families with highly similar or even the same cognate nonanucleotide ori sequences. This means that non-cognate target sequences and wild-type Reps can be used in multiplexed HUH-tag applications such as protein barcoding (24) .
Perhaps the most exciting use of Reps is their potential use as multiplexable bioconjugation HUH-tags to covalently append nucleic acid barcodes to protein components of interest as a broad-spectrum molecular counter (24) . Because HUH-tag linkages are specific, covalent, and can occur intra-or extracellularly without additional reagents, it is tempting to speculate that high-throughput barcoding of endogenous proteins is possible with the realization that wild-type Reps can be used as bioorthogonal tags. This may, for example, open the door for development of high-throughput proximity ligation assays (53,54) to accurately detect multiple protein-protein interactions at low abundance simultaneously in situ . Further, because sDBM engineered WDVc1 has PCV2-like specificity, we have shown that by simply mutating four amino acids within the sDBM, specificity can be predictably altered. We expect specificity toward designer target sequences could be engineered using the sDBM as a target variable region, bringing forth the inception of a library of engineered HUH-tags with defined sequence specificity for massively parallel Rep-based applications. In a gene therapy context, HUH-endonuclease mediated site-specific genome integration is a critical gene therapy approach but is limited to integration at native sequences existing in the targeted genome (23,46) , however we imagine chimeric Reps based on the AAV mechanism with engineered specificity for designer sequences could greatly broaden the diversity of integration sites.
With minor alterations to HUH-seq, it is conceivable for the HUH-seq NGS-based approach to expand to sensitive detection of sequence specificity profiles for enzymes such as dsDNA nucleases using a dsDNA library, RNA-cleaving enzymes by adding a single reverse transcriptase step, or site-specific nucleotide modifying enzymes by relying on the covalent modification blocking PCR amplification. The existing high-diversity library methods used to determine the specificity of zinc finger nucleases (55) , Cas9 (56) , transcription activator-like effector nucleases (TALENs) (57) , and restriction enzymes (58) are powerful and direct cleavage read-out approaches, however they require a number of extra library preparation steps and conceivably may be limited to only dsDNA libraries without more complicated modifications. HUH-seq is limited by the diversity size of the library; however, as NGS read capacity increases, along with computational processing and data storage, library size may become a negligible shortcoming of HUH-seq. Additionally, if sequence binding, rather than cleavage, could be optimized as a readout, HUH-seq could be developed as a facile alternative method to approaches such as SELEX-seq (59) in order to determine binding sequence preference of shorter DNA binding motifs than conventionally determined specificity of enzymes such as transcription factors without the need for multiple rounds of sequence enrichment and validation.
Understanding the molecular basis of Rep specificity is not only important for understanding the RCR mechanism and Rep-based biotechnology applications; Reps remain a main target for modulating viral infections using antivirals (60) . Billions of dollars worldwide are lost in agriculture every year from the decimation of products such as tomato, cassava, cotton, and beans by geminivirus infection (61) . Present antiviral strategies, both viral protein interfering and gene silencing approaches, are eventually subverted by conferred resistance from a rapidly evolving viral genome (60,(62)(63)(64)(65)(66) . Current anti-Rep antibodies (67) and peptide antivirals show initial effectiveness (68,69) , however we imagine development of antivirals specifically targeting the ssDNA binding of Reps could effectively retain long-term resistance. For example, transgenic host plants expressing nanobodies with high affinity for the single-stranded DNA docking interface of Reps could halt geminivirus replication. Moreover, because the ssDNA docking interface containing the sDBM is so structurally conserved, a similar approach may achieve success combating infections for related organisms such as circovirus and nanovirus, or even human disease causing parvoviruses.
In a human disease context, parvovirus B19 human infections are common and usually mild (20) , however a complex range of clinical manifestations ranging from serious or fatal outcomes for a fetus (70,71) and associations to autoimmune diseases in adults (21,22) has sparked a push to develop treatments and vaccines (72) . Parvovirus Reps are structural homologs to Reps that mediate RCR, however an additional loop in the sDBM thought to interact with dsDNA signals that more stringent recognition at the trs may enhance specificity with respect to RCR Reps (48) . Even with this difference, the PCV2 and WDV structures lend guidance for rational design of therapeutics targeting the ssDNA docking interface of parvovirus Reps. In the future, antivirals for infections are sure to become even more necessary as novel human disease causing parvoviruses are predicted to emerge at an increasing rate (73,74) ; indeed, human bocavirus (75) and human parvovirus 4 (76) have made interspecies jumps in recent years.

MOLECULAR CLONING, PROTEIN EXPRESSION, AND PURIFICATION
The N-terminal domain of all Reps were synthesized as E. coli codon-optimized gene blocks from Integrated DNA technologies (IDT), and designed with 15 nucleotides on each end homologous to regions of the linearized pTD68/His6-SUMO parent vector digested with BamHI and XhoI. Final His6-SUMO-Rep constructs were created with the In-Fusion HD Cloning Kit (Takara) and sequence confirmed with Sanger sequencing (Genewiz). Purified plasmids were transformed into BL21(DE3) E. coli competent cells (Agilent) initially cultured in 1L LB broth at 37ºC, then induced at OD600 with 0.5 mM IPTG (isopropyl-d-1-thio-galactopyranoside, Sigma Aldrich), then grown for 16 hours at 18°C. Collected cell pellets were resuspended in 10 mL of lysis buffer (50 mMTris pH 7.5, 250 mM NaCl, 1 mM EDTA, complete protease inhibitor tablet (Pierce)) and pulse sonicated for several one minute rounds. The suspension was centrifuged at 24,000xG for 25 min and supernatants were batch bound for 1 hour with 2 mL HisPure Ni-NTA agarose beads (ThermoFisher) equilibrated with wash buffer (50 mM Tris pH 7.5, 250 mM NaCl, 1 mM EDTA, 30 mM imidazole). After lystate cleared the gravity column, beads were washed with 30 mL wash buffer, and proteins were eluted from gravity columns with elution buffer (50 mM Tris pH 7.5, 150 mM NaCl, 1 mM EDTA, 250 mM imidazole). Protein was further purified and buffer exchanged into 50 mM Tris pH 7.5, 150 mM NaCl, 1 mM EDTA using the ENrich SEC70 (Bio-Rad) size exclusion column. Aliquots were stored at -20°C and -80°C at 30 μM. SUMO-cleaved recombinant PCV2 Y96F and WDV Y106F stocks for crystal screening were prepared in a similar manner as above, however Ni-NTA fractions were dialyzed into 50 mM Tris pH 7.5, 300 mM NaCl, 1 mM EDTA as above with the addition of 1 mM DTT and SUMO-cleaving protease ULP-1 at 5 U per 1 L of E. coli overnight at 4°C. Dialyzed samples were batch bound a second time with Ni-NTA beads and were flowed through a gravity column to remove cleaved His6-SUMO and His6-ULP-1. Protein was concentrated with spin concentrators (Amicon Ultra-15 Centrifugal Filter Unit, 3 kDa cut-off) to 16 mg/mL.

STANDARD IN VITRO HUH CLEAVAGE ASSAY
HUH-tag cleavage of the synthetic oligos was carried out using final concentrations of 3 µM HUH-tag and between 4.5 -30 µM oligo in 50 mM HEPES, 50 mM NaCl, and 1 mM MnCl 2 for 30 min at 37°C. The reactions were quenched with 4x Laemmli buffer containing 5% β-ME, boiled for 5 min at 100°C, and run on a 4-12% SDS-PAGE acrylamide gel. For time course reactions, aliquots were removed from an HUH reaction master mix at specified time intervals and immediately quenched in 4x Laemmli buffer containing 5% β-ME. Percent covalent adduct formation was calculated using Bio-Rad ImageLab software. The background subtraction function of ImageJ was used to process all gel images post-analysis.

HUH-SEQ ssDNA LIBRARY CLEAVAGE, LIBRARY PREPARATION, AND SEQUENCING
A 90-nt ssDNA library with a central 7 base randomized region flanked by conserved regions harboring primer binding sites at either termini (7N ssDNA library) was constructed using IDT oPools service consisting of 128 individually synthesized DNA oligos mixed at equal molarity (SI Table). This method produced a minimally biased distribution of all 16,384 possible kmers (SI, histogram of oPools vs. ultramer). Recombinant Rep HUH-tag cleavage of the 7N ssDNA library was carried out in triplicate in 3 µM HUH-tag and 300 nM (83.4 ng/µL) ssDNA library in 50 mM HEPES, 50 mM NaCl, and 1 mM MnCl 2 for 1 hour at 37°C. The HUH-tag enzymes were immediately heat inactivated by boiling at 95°C for 3 minutes. The remaining uncleaved ssDNA library from each HUH-tag in-vitro cleavage reaction was diluted 10-fold in water and amplified using 0.5 µM TruGrade/HPLC purified primers from IDT containing Nextera adapters and spacer regions with 2x CloneAmp TM HiFi PCR Premix for 30 cycles. The resulting product was a 200 bp dsDNA amplicon run on a 1.5% agarose gel and stained with SybrSafe (SI, agarose gel image of one trial). Each 200 bp product was gel extracted (NucleoSpin Gel and PCR Clean-up kit, Machereny-Nagel) and eluted in 30 µL NE buffer resulting in samples of 30-60 ng/μL. All samples were barcoded with Illumia dual-indexing sequences via the Nextera adapters (University of Minnesota Genomics Core). Indexed samples are were pooled and run on a 1.5% agarose gel; the 270 bp barcoded pooled sample was gel extracted and then sequenced using a single Illumina HiSeq lane (350,000,000 paired-end reads, Genewiz) spiked with 30% PhiX.

HUH-SEQ READ COUNT REDUCTION ANALYSIS AND SEQUENCE LOGO GENERATION
Raw NGS sequence data were processed using R. Non-randomized portions (e.g. adapter sequences) were removed from each read to extract only the randomized 7-mer ( k -mer). 7-mers from reverse reads were reverse-complemented, and frequency counts for each of the 16,384 unique 7-mers were generated for the reference library and each of the HUH-tag treatment libraries. Each treatment was then compared against the reference to estimate a log2-fold-change and percent reduction (reference -treatment / reference) for each of its 7-mers. The percent reduction data was used to generate weighted sequence logos for each HUH-tag using the ggseqlogo package in R.

EXTRACTING ORTHOGONAL HUH-TAG AND K -MER SETS
Orthogonality of HUH-tags was determined in silico using a custom R script. The script first iterates through each HUH-tag and labels it as strongly reactive, moderately reactive, or nonreactive with each of the k -mers, with any l og 2 FC under -3.0 considered strongly reactive, and any over -0.3 considered nonreactive. Then, the number of strongly-reactive-plus-nonreactive k -mers is counted for every possible pairing of HUH-tags. Two HUH-tags, A and B, are labeled as "likely orthogonal" if there exists at least one such k -mer in each direction-one where A is strongly reactive and B is nonreactive, and another where A is nonreactive and B is strongly reactive.

CRYSTALLIZATION, DATA COLLECTION, AND PROCESSING
An 8-mer oligonucleotide (5'-dAATATTAC-3') from part of the geminivirus origin of replication sequence was reconstituted in ddH20 at 10mM and mixed with recombinant WDV Y106F . We used Rigaku's CrystalMation system to perform a broad, oil-immersion, sitting drop screen of the protein-DNA mixture in the presence of either magnesium or manganese. Crystals were achieved using 8 mg/mL protein solution containing 1.1-fold 8-mer and 5mM MnCl 2 with a well solution of 12% (w/v) PEG8000 precipitating agent, 0.2 mM zinc acetate, and 0.1 M sodium cacodylate at pH 6.5. The crystals belong to space group P 4 1 with unit cell dimensions of a = b = 50.63 Å, c = 241.98 Å. Four complexes were present per asymmetric unit. P 4 1 2 1 2 was also a solution for this crystal morphology with 2 complexes per asymmetric unit, however, we failed to solve a suitable model at higher symmetry possibly due to variations in the crystal structure. Addition of any cryoprotectant to these crystals resulted in poor diffraction; the crystals seemed to collapse upon vitrification. Our solution to this issue was to collect datasets using an in-house, x-ray diffractometer (Rigaku Micromax-007 Rotating Anode, Rigaku Saturn 944 CCD Detector) at room temperature. Radiation caused minimal crystal damage, and over 100 frames could be obtained from a single crystal. All data was processed with the HKL suite.

STRUCTURE SOLUTION AND REFINEMENT
The WDV Y106F + 8-mer structure was solved with the molecular replacement function in PHENIX using our previously solved structure of apo WDV Rep (PDB ID: 6Q1M) as a model. We visualized the electron density map using Coot (77) , and clear patches in the map we recognized to be a chain of nucleotides. All 8 nucleotides of the 8-mer oligonucleotide were unambiguously built into well-defined electron density of each of the 4 complexes in the asymmetric unit. Subsequent refinement was performed with default settings of PHENIX auto.refine with NCS applied (78) and alternated with visual inspection and model correction. Final R-work and R-free were 0.167 and 0.236 respectively.
The WDV Y106F + 10-mer, P 2 1 2 1 2 1 structure was solved with the Phaser molecular replacement function in PHENIX using the previously solved WDV Y106F + 8-mer structure. The two additional nucleotides were modeled into appropriate density. Again, Coot was used for model building, and PHENIX auto.refine was used for refinement. The final R-work and R-free were 0.173 and 0.224 respectively.
A model for molecular replacement was generated in PyMol by superimposing the WDV Y106F + 8-mer structure with the porcine circovirus Rep domain (PDB ID: 5XOR) structure. The 8-mer from the WDV model was added to the PCV Rep domain model and used for Phaser molecular replacement in Phenix. The two additional nucleotides were modeled, and the oligonucleotide sequence was corrected using Coot. PHENIX auto.refine was used for refinement. While refining, two of the complexes in the asymmetric unit had well-defined electron density. Density corresponding to the third complex was poorly defined, and modeling was difficult. As a result, R-values are higher than normal for this resolution structure. R-work and R-free were calculated to 0.245 and 0.301 respectively.

Supp. Note 1: Scissile phosphate completes coordination of manganese in rep active sites
In the PCV2 Y96F + 10-mer active site contains an HUQ motif rather than an HUH motif along with the structurally conserved Glu48 that coordinates a manganese ion (ref fig.). The octahedral coordination of the ion is completed by an adjacent water molecule, which may be positioned by Arg54 or Glu100, and the O1 and O3' of the scissile phosphate. A sequence conserved lysine (cite), Lys99, thought to act as a general base (cite), may deprotonate the catalytic tyrosine poised near Phe96. The WDV Y106F + 10-mer active site reveals identical coordination of manganese by the HUH motif and scissile phosphate, however, in contrast, uses Glu110 to coordinate the metal ion instead of the predicted Glu49 residue, which seems to result in shifting Lys109 out of position to act as a general base. It is unknown whether this coordination is a crystallographic artifact or if geminivirus Reps use a different cleavage mechanism that does not rely on a general base to activate the tyrosine. Rep and relaxase active sites are highly similar and perform the same general DNA cleavage mechanisms, using divalent metal ion coordination of O1 and O3' of the scissile phosphate thereby further polarizing the partial positive charge of the phosphate catalyzing a nucleophilic attack of by an adjacent active residue, though relaxase use a second conserved tyrosine for the DNA re-joining reaction (ref).

Supp. Fig. 1: Hairpin structure and consensus cognate sequence recognized by panel of Reps
Consensus cognate nonanucleotide ori sequence of 10 different Reps from Circoviridae, Nanoviridae, and Geminiviridae (see Table  1 and SI). The origin or replication ( ori ) from these ssDNA viruses contains a stem-loop hairpin with Rep cleavage occurring between position -1 and +1 within the nonanucleotide sequence. The viral ori contains a stem that varies in sequence and between 9-11 base pairs in length while the loop contains the cognate nonanucleotide sequence and varies between 10-13 nucleotides in length.

Supp. Note 2
Sequential truncations of this geminivirus ori sequence of either the 5' end or 3' end indicated that at most, the nonameric sequence is necessary for sufficient cleavage activity because all ten HUH-tags produced high adduct formation when the full nonameric sequence was retained in the target oligo. Perhaps the position -7 nucleotide is not necessary, because adduct formation was nearly identical in the absence of this nucleotide. Strikingly, CpCDV, and to a smaller extent PCV2 and CLCV, retain cleavage activity when only nucleotides in positions -2 through +2 are present (T -2 T -1 *A +1 C +2 ). In contrast, TYLCV retained cleavage activity only when 8 of the 9 positions are retained in the target sequence, negating position -7 (A -6 A -5 T -4 A -3 T -2 T -1 *A +1 C +2 ).
Single substitutions along the target sequence had no broad effect on the geminivirus HUH-tags, except TYLCV activity was hampered by substitutions at positions -6 through +2. The nanovirus HUH-tags, FBNYV and BBTV, have nearly identical target sequence profiles to that of TYLCV but tolerate both a transition and transversion at position -5. Interestingly, several double substitutions flanking the cleavage site of the target sequence were largely tolerated by geminivirus HUH-tags, again with the exception of TYLCV, but cleavage of these sequences by circovirus and nanovirus HUH-tags were abrogated.
Supp. Fig. 4: Standard in vitro cleavage assay with Rep panel and oligo library a, The standard in vitro HUH reaction schematic where the catalytic tyrosine forms a covalent adduct with the phosphate at the +1 position scissile phosphate in the target sequence of a synthetic oligo. The phosphotyrosine adduct is stable under denaturing conditions evident by an upward shift from SDS-PAGE analysis. b, The truncations heatmap displays Rep covalent adduct formation with sequential truncations of synthetic DNA oligos harboring partial geminivirus ori target sequence, with the cognate nonanucleotide ori sequence underlined. c, . The substitutions heatmap displays HUH-endonuclease covalent adduct formation with single or multiple substitutions within synthetic DNA oligos harboring the geminivirus ori target sequence, either 26nt or 17-nt in length (the entire 26-nt sequence is not shown in the figure). Full sequences of synthetic oligos used are provided. The conjugation reaction is carried out under the standard HUH in vitro reaction conditions with a 1:10 Rep:oligo ratio for 30 minutes. Each cell reflects a percent covalent adduct formation for a single reaction.

Supp. Note 3
Because every sequence in the library must be at a high read count abundance due to HUH-seq being a read count reduction based assay, we limited the library size (SI Eq. 1) based on the total read counts provided by the HiSeq platform. We reasoned the size of the library, number of treatments, number of replicates, total read counts desired per k -mer, and percent of PhiX spike-in would all contribute to how large a library could be used (SI Eq. 2). We settled on creating a ssDNA library with 7 randomized nucleotides, which would produce on average 415 read counts per k -mer for 24 samples in triplicate.

Supp. Equations 1 and 2:
Eq. 1 determines the number of total kmers of user specified length in the synthetic ssDNA library dictated by the number of nucleotide options and number of randomized positions. Eq. 2 is used to estimate the average read count per kmer dictated by estimated total read count given by a NGS platform minus the recommended 30% PhiX spike in, all divided by the number of kmers in the ssDNA library, number of HUH-endonuclease samples and reference sample, and the number of replicates. This can be used to help determine which NGS platform will give sufficient read counts per kmer.

Supp. Note 4: HUH-seq cleavage assay caveats and controls
Lower concentrations of WDV had minimal impact on the ssDNA recognition profile of WDV, yet the lower maximum average percent reduction values correspond to fewer k -mer sequences being cleaved. Removing SUMO from WDV minimally affected the ssDNA recognition profiles as well. Interestingly, the inactive WDV Y106F mutant control revealed k -mers with significant percent reduction over the reference library but to a much lesser extent than WT WDV. We have verified that this inactive WDV mutant does not cleave the 26-nt geminivirus nonanucleotide ori sequence oligo using the standard in vitro cleavage assay (SI gel). We reasoned that the WDV Y106F is still able to tightly bind the preferred target sequence and decrease the read count perhaps by partially blocking amplification of sequences bound to Rep rather than physical separation of the PBS sites due to cleavage. This result indicates that decreased read counts may be a consequence of both HUH-tag binding and cleavage at a detectable yet much lower rate. Further, the ssDNA recognition profile generated for WDV y106F is almost identical to that of WT WDV (SI Figure X) . We could refine this assay in the future with a more effective Rep inactivation step ensuring only cleavage is a readout for the assay.
Another caveat we noticed is that the MSMV sequence logo did not show a prefered target sequence profile similar to that of the cognate nonanucleotide ori sequence. We presume that MSMV had a high cleavage rate in the constant region of the 7N ssDNA library somewhere near the 5' end of the randomized region resulting in a largely random sequence logo (SI Figure X B). Optimizing the sequence of flanking constant regions of the 7N ssDNA library to limit cleavage in this region may allow us to generate an accurate target sequence cleavage profile of MSMV in the future. Simple optimization could eliminate these caveats, though it would give uncritical advantages.
Unknown binding, cleavage, and release kinetic factors may differ for each Rep and could be convoluting our ability to compare specificity. Ascertaining a full kinetic profile of each of these steps may give a better comparative picture of sequence specificity. In some cases, a high protein to substrate concentration ratio is known to negatively impact sequence specificity even of highly specific DNA-binding proteins such as zinc finger nucleases (55) . HUH-seq logos corresponding to two lower WDV concentration treatments revealed a slight variation in nucleotide preference, however this was due to variability between samples rather than a concentration dependent altering of sequence specificity (Supp. Fig. 3a). Interestingly, a nearly identical profile with a much lower maximum average percent reduction to that of WDV was generated from an inactive WDV Y106F treatment that we intended to use as a negative control (Supp. Fig.  3c-d). This indicates that binding is a contributing factor to the HUH-seq readout rather than cleavage only to a small extent, though is a negligible factor. A largely random sequence logo was produced for the MSMV Rep (Supp. Fig 3e). We hypothesize this is because a large percentage of the library was cleaved within the constant regions rather than within the intended target sequence. Despite these few caveats, HUH-seq is a robust method for determining the specificity of ssDNA recogntion by Reps.
Supp. Fig. 6: Effects of enzyme concentration dependence, tag-less enzyme, and inactive enzyme on HUH-seq readout. a, The weighted sequence logos of WDV at two lower concentrations, ten-fold less (x0.1) or equimolar (x1.0) WDV, in respect to the total 7N ssDNA library concentration. B. The weighted sequence logos of WDV with SUMO-cleaved. C. Recombinant SUMO-WDV Y106F reacted with and without the 26-nt geminivirus ori sequence oligo under standard HUH in vitro assay conditions with 1:10 protein:oligo ratio. The SDS-PAGE gel scan is cropped to remove unrelated data. D. The weighted sequence logo WDV Y106F and E. MSMV generated from HUH-seq analysis. Heights are scaled to represent the average percent reduction of each base at each position when compared to the reference library. Black sequences below each logo of the cognate nonanucleotide ori sequences from each respective virus.

Supp. Table 1: Individual BASA and contact values for protein-DNA interactions in co-crystal structures
Breakdown of all protein:DNA contacts and BASA values calculated by DNAproDB using default positions with ribose and phosphate interactions turned on.

Supp. Figure 7: Discovery and validation of a target sequence exclusively cleaved by PCV2 Rep
Single row heat map shows log 2, FC for each Rep treatment fromHUH-seq analysis for k -mer #2884, which bears a AGTCAAT sequence. There are four substitutions underlined at positions -6, -5, -4, and -2 in respect to the circovirus cognate nonanucleotide ori sequence. Below the heatmap is a respective gel highlighting covalent adduct formed under the standard HUH in vitro cleavage assay with a 1:2 protein:oligo ratio where only PCV2 forms adduct with k -mer #2884 ,and no detectable adduct is formed with any of the other 9 Reps.