Exploring protein sequence and functional spaces using adversarial autoencoder

Shedding light on the relationship between protein sequences and their functions is a challenging task with implications for our understanding of protein evolution, diseases, or protein design. However, due to its complexity, protein sequence/function space is hard to comprehend with potential numerous human bias. Generative models help to decipher complex systems due to their abilities to learn and recreate data specificity. Applied to protein sequences, they can help to pointing out relationships between protein positions and functions or to capture the different sequence patterns associated to functions. In this study, an unsupervised generative approach based on auto-encoder (AE) is proposed to generate and explore new protein sequences with respect to their functions. AE is tested on three protein families known for their multiple functions, with manually curated annotations for one family. Functional labeling of encoded sequences on a two dimensional latent space computed by AE for each family shows a good agreement regarding the ability of the latent space to capture the functional organization and specificity of the sequences. Furthermore, arithmetic between latent spaces and latent space interpolations between encoded sequences are tested as a way to generate new intermediate protein sequences sharing sequential and functional properties of sequences issued of families with different sequences and functions. Using structural homology modeling and assessment, it can be observed that the new protein sequences generated using latent space arithmetic display intermediate physico-chemical properties and energies when compared to the original sequences of the families. Finally, protein sequences generated by interpolation between data points of the latent space show the ability of the AE to smoothly generalize and to produce meaningful biological sequences from an uncharted area of the latent space. Code and data used for this study are freely available at https://github.com/T-B-F/aae4seq.


Introduction
Protein sequence diversity is the result of a long evolutionary process. This diversity, or sequence space, is constrained by natural selection and only a fraction of amino acid combinations out of all possible combinations are observed. Given its huge size, exploring the sequence space and understanding its hidden 1 constrains is very challenging. Protein families are groups of related proteins, or part of proteins, and represent a useful description to reduce the sequence space complexity. Many resources have been developed over the years to group protein sequences in families with members sharing evidence of sequence similarity or structural similarity [8,11,22]. However, diversity also exists inside a family and one family may groups together several proteins with different molecular functions [5]. Navigating the sequence space with respect to the functional diversity of a family is therefore a difficult task whose complexity is even increased by the low number of proteins with experimentally confirmed function. In this regard, computer models are needed to explore the relationships between sequence space and functional space of the protein families. Understanding the relationships between amino acids responsible of a particular molecular function has a lot of implications in molecular engineering, functional annotation, and evolutionary biology. In this study, an unsupervised deep learning approach is proposed to model and generate protein sequences. The ability of this approach to capture the functional diversity and specificity of proteins is evaluated on different protein families.
Variational autoencoders (VAE) have been applied on biological and chemical data to explore and classify gene expression in single-cell transcriptomics data [17], predict the impact of mutations on protein activity [26,32] or to explore the chemical space of small molecules for drug discovery and design [14,25]. Their ability to reduce input data complexity in a latent space and performs inference on this reduced representation make them highly suitable to model, in an unsupervised manner, complex systems. In this study, Adversarial AutoEncoder (AAE) [19] is proposed as an application of a new and efficient ways to represent and navigate the functional space of a protein family. Autoencoders are a type of deep neural network able, using the encoder, to reduce high dimensional data by projection to a lower dimensional space (also known as a latent space or hidden code representation) which in turn can be reconstructed by the decoder. AAE [19] architecture corresponds to a probabilistic autoencoder, but with a constraint on the hidden code representation of the encoder which follows a defined prior distribution. This constraint is performed by a generative adversarial network (GAN) [12] between the latent space and the prior distribution, and ensures that meaningful samples can be generated from anywhere in the latent space defined by the prior distribution. Therefore, applied to biological sequences of a protein family, it is possible to encode the sequence diversity to any prior distribution and to analyze the ability of the AAE to learn a representation with respect to the functions of the protein family considered. Furthermore, and contrary to dimensional reduction techniques such as PCA or t-SNE [18], AAE networks have the ability to perform inference on the latent space, meaning that any sampled point of this distribution can be decoded as a protein sequence. Sampling of the latent space was therefore analyzed, with a particular focus on latent space arithmetic between proteins of different sub-families with different functions to validate the ability of the model to learn a meaningful representation of the sequence space. Arithmetic operations on latent space have previously been reported to transfer features between images of different classes and may therefore have interesting potential for molecular design. Interpolations between points (encoded protein sequences) of the latent space were also performed to evaluate the ability of the AAE to navigate and generate new protein sequences from unseen data points.
To test these hypotheses, three different protein families were selected including one, the sulfatase family, for which the functions of some sub-family have been manually curated [2]. The sulfatases are a group of proteins acting on sulfated biomolecules. This group of proteins is found in various protein family databases, such as in Pfam (PF00884, Sulfatases). However, as mentioned previously, they can have different substrate specificity despite being in the same family. The SulfAtlas database [2] is a collection of curated structurally-related sulfatases centered on the classification of the substrate specificity. The majority of sulfatases (30,726 over 35,090 Version 1.1 September 2017) is found in family S1 and is sub-divided into 73 sub-families corresponding to different substrate specificity. Sub-families S1 0 to S1 12 possessed experimentally characterized EC identifier. The two other protein families, HUP and TPP families are not manually curated but were selected as they are known to have multiple functions [5].
For each family an AAE network is trained on the protein sequence space and the functional diversity of the family in the AAE latent space is analyzed.

Results
A structurally constrained MSA was computed using Expresso [1] from T-Coffee webserver [9] between sequences of S1 sulfatases structures. This MSA was processed into a Hidden Markov Model and hmmsearch was used to retrieve aligned sequence matches against the UniRef90 sequence database.
A total of 76,427 protein sequence hits were found matching UniRef90. The sequences were filtered to remove columns and hits with more than 90The final alignment comprised 41,901 sequences. The Sulfatases protein dataset was separated in a train, validation, and test sets with a split ratio of: 0.8, 0.1, and 0.1.
A comparison of architectures for protein sequence family modeling is outside of the scope of this study whose focus is on the ability of the AAE architecture to model and infer protein sequences. Three different AAE architectures were trained (see Method section), but without extensive hyper-parameters optimization. The three architectures were trained on the train set and evaluated on the validation set. The test set was only used on the final selected architecture. Models were evaluated by computing top k-accuracy, corresponding to the generation of the correct amino acid in the first k amino acids. Table 1 shows the top k accuracy metric for k=1 and k=3 computed for the different autoencoders. The accuracy scores scaled down with the number of parameters, but without any large differences. To avoid over-fitting, the architecture with the fewest number of parameters (architecture 3) was therefore selected. The final accuracy scores on the test set were computed and were similar to the values observed during the model training: 62.5% and 80.2% (k=1 and k=3).
The selected architecture was separately trained on the protein sequences of the HUP and TPP families. 2.1 Latent space projection AAE can be used as a dimensional reduction techniques by fixing the dimension of the latent space to two or three for plotting and data exploration. Therefore, the final MSA of the sulfatase family was used to train an AAE using two latent dimensions. For comparison, the MSA projection using the first two components of the PCA decomposition was also computed. Figure 1: Sequences of the SulfAtlas MSA projected using the encoding learned using an AAE (number of latent dimension: 2) and a PCA (two first components). Gray data points correspond to protein sequences not found in the first 12 sub-families. Figure 1 shows the protein sequences encoded by the AAE in two latent dimensions and the PCA projections. Each dot corresponds to a protein sequence, and the dot are colored according to their sub-family. Gray dots correspond to protein sequences not belonging to any of the 12 curated sulfatases sub-families. In this figure, the AAE displays a clear superior power to disentangle the sequence and functional spaces of the S1 family than a PCA. Interestingly, it can be observed in the AAE projection some well-separated gray spikes (sequences not belonging to any curated sub-family). These spikes may correspond to groups of enzymes sharing common substrate specificity. For some cases, the sub-family with identical functions are projected closely on the encoded space. For instance, sub-families S1 6 (light magenta) and S1 11 (yellow) have both the activity EC 3.1.6.14 (N-acetylglucosamine-6-sulfatase) and are close in the encoded space. Moreover, some sub-family projections appear entangled such as the S1-1 sub-family (light blue, Cerebroside sulfatase activity EC 3.1.6.8), the S1-2 (orange) and the S1-3 (green) sub-families (Sterylsulfatase activity, EC 3.1.6.2) the S1-5 (pink) sub-family (N-acetylgalactosamine-6-sulfatase activity, EC 3.1.6.4), and the S1-10 (gray) sub-family (Glucosinolates sulfatase activity EC 3.1.6.-). The five families correspond to four different functions but are made of Eukaryotic protein sequences only and their entanglement may be due to their shared common evolutionary history. This separation based on the sequence kingdoms can clearly be visualized in the PCA projections with Eukaryotic sequences on the right side on sub-families with a majority of Bacteria sequences on the left side. The example of the protein B6QLZ0 PENMQ is also interesting. The protein is classified in the SulfAtlas database as part of the S1-4 and S1-11 sub-families but projected (yellow dot at coordinates (0.733, -1.289)) inside the space of the S1-4 family (red). Similar bi-classification can also be found for proteins between of the S1-4 and S1-8 sub-families: F2AQN8 (coordinates (0.795, -0.982)), M2AXU0 (coordinates (0.858, -0.994)), and Q7UVX7 (coordinates (0.883, -1.052)).
Projection of sequence spaces using AAE with 2 latent dimensions were also tested on the HUP and TPP families. The AAE projections can be visualized on Figure 2. There are fewer functional annotations for these two families, but it can clearly be seen a strong separation between the major functions of the families. HUP points colored in yellow correspond to protein with EC 6.1.1.1 and 6.1.1.2, pink-colored points to proteins with EC 6.1.1.1 and violet colored points to proteins with EC 2.7.11.24 and 6.1.1.2. TPP sequences have more annotated functions than HUP sequences (57 different EC assignation), but a global pattern can be found in the projection corresponding to two groups of proteins (brown and violet) annotated with EC 2.  Latent spaces were evaluated for each protein family based on enzyme classification (EC) and taxonomic homogeneity. Given a set of protein sequences, the encoded sequences in a latent space of 100 dimensions were clustered using HDBSCAN and the clusters were evaluated according to the enzyme class and taxonomic group with the highest homogeneity inside a cluster. For the sulfatase family 27 clusters were found for which taxonomic and EC annotations could be extracted (Supplementary Table 3). All these clusters displayed either strong taxonomic or EC homogeneity. Enzymatic homogeneity was higher than taxonomic homogeneity for 16 clusters, found equal in one cluster and lower for 10 clusters. In the HUP family, all clusters have very high EC homogeneity (Supplementary  table 4). Only two clusters out of 47 could be found with higher taxonomic homogeneity than EC homogeneity and for this two clusters enzymatic homogeneity values were high and only marginally different (cluster 5, taxonomic homogeneity of 100% an EC homogeneity of 99% and cluster 31, taxonomic homogeneity of 99 % and EC homogeneity of 97%). Five clusters were found with equal taxonomic and EC homogeneity. Similarly, in the TPP family, all clusters have also very high EC homogeneity (Supplementary table 5). Five clusters out of 51 could be found with higher taxonomic homogeneity than EC homogeneity. For these 5 clusters the differences between taxonomic homogeneity and EC homogeneity is higher than the differences observed for the HUP clusters. Six clusters were found with equal taxonomic and EC homogeneity. These results highlight the ability of the encoding space to capture amino acid functional properties instead of signal due to taxonomic relationships.

Protein latent space arithmetic
It has been shown that latent space arithmetic was able to transfer learned features between different classes. This ability is evaluated in the case of protein sequences has it may allow to explore protein families by transferring features of a first sub-family to a second sub-family while maintaining the general property (such as its structure). To test this hypothesis, Sulfatases sub-families with at least 100 labeled members but with less than 1000 members (to avoid pronounced imbalance between classes) were selected: S1-0 (308 proteins), S1-2 (462 proteins), S1-3 (186 proteins), S1-7 (741 proteins), S1-8 (290 proteins) and S1-11 (669 proteins). The only sub-family with more than 1000 members is the S1-4 sub-family (2064 proteins). Different arithmetic strategies (see Methods and Figure 4) were tested between latent spaces of a query sub-family and a different source sub-family with the aim to transfer features of the source sub-family to the query sub-family. In the following section, the terminology Seq. S1-XmY will correspond to a sequence generated using the latent space of the sub-family S1-Y added to the latent space of the sub-family S1-X. The X sub-family will also be referred as the query sub-family and the Y sub-family as the source sub-family. The source sub-family is obtained using the mean value of the latent space corresponding to the sequence of the sub-family Figure 5 displays logo plots of two regions corresponding to Prosite motifs PS00523 and PS00149 to illustrate the amino acid content of the protein sequences generated by latent space arithmetic (Supplementary data Information for the full logo plots). These regions correspond to the most conserved regions of the sulfatase family and have been proposed as signature patterns for all the sultatses in the Prosite database. Panels A and D correspond to the biological sequences of the S1-0 sub-family and of the S1-2 sub-family respectively. Panels B and C correspond to generated protein sequences using either the Sulfatase sub-family S1-0 as the source and S1-2 as the query (Panel B) and to which the background latent space has been subtracted (strategy 2 on Figure 4) and reciprocally (Panel C).
Different amino acid patterns can be observed between the different motifs of the sequence groups. In the first fragment corresponding to motif PS00523, G55 and T55 are the most frequent amino acid of sub-families S1-0 and S1-2 (Panels A and D) and it can be observed a "competition" between these two amino acids for generated sequences (Panel B and C) with a slightly higher probability for the amino acid of the family used as a query (G in Panel B and T in panel C). The residue S57, involved in the active site of the sulfatases [3], is less frequent in the query sub-family S1-0 (panel A) than in the sub-family S1-2 (panel D). The high frequency of S at position 57 in the sub-family S1-2 compared to the sub-family S1-1 may have an impact when performing the latent space arithmetic as S is predominant in the generated sequences. This influence of a very frequent amino acid in one of the source or query sub-family on the generated sequences can also be observed at the position 70 and is less visible when multiple amino acid frequencies are more balanced. In the second fragment corresponding to motif PS00149, residue R at position 101 follows this pattern. It is highly frequent in sub-family S1-0 (panel A) and less frequent in the sub-family S1-2 (panel D). The generated protein sequences display a R at position 101 with high frequency. The inverse can be observed for Y at position 105, highly frequent in sub-family S1-2 (panel D).
Other positions are however displaying much more complex patterns and cannot be summarized as a frequency competition between source and query sub-families. For instance, G at position 71 is very frequent in the sub-family Figure 4: Modeling pipeline used to generate sequences sharing properties of two sub-families. The hmmsearch MSA is filtered and passed to the encoder to project each sequence to the latent space. Latent space projections can be used for visualization (see Figure 3). Different strategies (1 to 4) are tested to generate new points in the latent space and generate new sequences through the decoder. The new sequences are used in combination with structures of the sub-families to create homology-based structural models and evaluated using the DOPE energy functio of MODELLER. S1-2 but have a comparable frequency with R in sub-family S1-0. The generated protein sequences don't display G has the only possible residue but seem to follow the frequency of their respective query sub-families. Positions in generated sequences with multiple amino acid sharing comparable frequencies in the source and query sub-families have also mixed frequencies, such as in positions 53, 59, 67, 98, or 106.
These behaviors can be observed several times through the logo plots but are still positions specifics, meaning that the bits scores pattern observed in the source sub-families (Panels A and D) do not necessary allow to predict the amino acids bits scores in the generated sub-families (Panels B and C). For instance, W at positions 113 as a high bit score value in the MSA of the sub-family S1-2 but little influences in the amino acids of the generated sequences where T of the sub-family S1-0 is found predominantly. Moreover, these observations are performed on residues of the Prosite motifs which are by definition conserved in sulfatases. Figure 5: Logo plot of MSAs parts from the S1-0 and S1-2 (panels A and D) sub-families, and generated sequences using S1-0 as query and S1-2 as source (panel B) and S1-2 as query and S1-0 as the source (panel C).
Protein sequence similarities where computed to evaluate the diversity of the generated sequences and to compare their diversity to the original sub-families. Protein sequence similarities were computed between: • sequences of a group of generated sequences, • sequences of a sulfatase sub-family used to generate protein sequences, • generated sequences and their query sulfatase sub-family, • generated sequences and their source sulfatase sub-family, • query and source sequences of sulfatase sub-families. Figure 6 shows the mean and variance distribution of computed protein sequence similarities between these different groups for generated sequences computed using the first strategy. The distributions for the sequences computed using the second, third, and fourth strategies are available in the Supplementary Information. The third, second, and third strategies display a similar pattern. Protein sequence similarity between different sub-families (red upper triangles) have lower similarity score and lower variance than the other distributions. Protein sequence similarity between sequences of a sub-family (blue dots) have the highest mean and variance values observed. However, only 6 sub-families were kept for analysis (sub-families 0, 2, 3, 7, 8, and 11) and trends must therefore be taken with precaution. Generated protein sequences compared to themselves (magenta lower triangles) have mean and variance protein sequence similarities higher than when compared to their query or sub-families (orange squares, green x). The last two have mean and variance values spread between the blue and red distributions. Thee distributions indicate that protein sequences generated by latent space arithmetic have an intrinsic diversity similar to the biological sub-families. Moreover, the generated sequences are less similar to the sequences from their query and source sub-families than to themselves. The generated sequences are also globally as similar to the sequences of their query sub-family as to the sequence of their source sub-family. The generation process is therefore able to capture the features of the query and source sub-families selected and to generate a protein sequence diversity similar to the original sub-families.
Finally, protein structural modeling was performed to assess and compare the properties of the generated sequences by latent space arithmetic and the protein sequences of the natural sub-families. For each sub-family, 100 original sequences were randomly selected along the corresponding generated sequences. All the generated sequences were aligned to protein structures of their corresponding source and query sub-families, and the alignments were used to create structural models by homology. The models were evaluated with the DOPE function of MODELLER. Figure 7 shows an example of the energy distribution computed from models using the second strategy with query sub-family S1-0 and source sub-family S1-2.
The lowest energies (best models) are found when modeling the original protein sequences of a sub-family to the structures of the same sub-family (Struct. 0 Seq. 0 and Struct. 2 Seq. 2). Inversely, the highest energies are found when modeling the original protein sequences of a sub-family to the structures of another sub-family (Struct. 0 Seq. 2 and Struct. 2 Seq. 0). Interestingly, sequences generated using addition and subtractions of latent spaces have intermediate energy distributions. This can be clearly observed in Figure 8, where generated sequences are mostly between the two dotted lines. Dots on the right side of the vertical line at 0 correspond to modeled structures using sequences of the latent space with lower energy than the modeled structures using sequences from their original sub-family. Dots on the left side of the vertical line at 0 are modeled structures using sequences of the latent space with higher energy than the modeled structures using sequences from their original sub-family. The diagonal line on the top-left corner corresponds to the difference in energy between modeled structures using sequences from their original sub-family and modeled structures using sequences from biological sequences of another sub-family. Generated sequences modeled on structures belonging to the same sub-family than their query latent space sub-family (ex: Struct. 0 Seq.S1-0m2 and Struct. 2 Seq. S1-2m0 on Figure 7 and M QS /Q on Figure 8) have slightly lower energy than when modeled on structures corresponding to the sub-familyof their source latent space sub-family (ex: Struct. 0 Seq. S1-2m0 and Struct. 2 Seq. S1-0m2 on Figure 7 and M SQ /Q on Figure 8). This trend is true for all query / source pairs of sub-families and all strategies except for sequences generated using the fourth strategy (local background subtraction of query latent space using a KD-tree and the addition of source latent space), see Supplementary  Figures 12, 14, 16 and Methods. In this strategy, the modeled structures using generated sequences do not display energy distributions in-between the energy distributions of the original sequences modeled on structures of the query or the source sub-families (dotted Figure 6: Distributions of protein sequence similarities. Blue dots correspond to protein sequence similarity computed between sequences of the same protein sub-family. Orange squares to similarity computed between generated sequences and the sequences of their query sub-family (ex: S1-0m2 generated sequences and S1-0 sub-family sequences). Green x to similarity computed between generated sequences and the sequences of their target sub-family (ex: S1-0m2 generated sequences and S1-2 sub-family sequences). Red upper triangles to similarity computed between sequences of two different sub-families (ex: S1-0 sequences and S1-2 sequences). Magenta lower triangles to similarity computed between sequences of the same generated sequence group. The variance and the mean of each distribution are displayed on the horizontal and vertical axes. lines). The energy distribution of generated sequences modeled on structures belonging to the sub-family of their query latent space sub-family (ex: Struct. 0 Seq.S1-0m2, blue dots M QS /Q) with the fourth strategy is closer to the energy distribution of the modeled structures using a sequence and a structure template from the same sub-families. The energy distribution of generated sequences modeled on structures corresponding to the sub-family of their source Figure 7: Energies distribution of models computed using structures from subfamily S1-0 (reds) or sub-family S1-2 (blues) and sequences from biological proteins or inferred using latent space arithmetic between spaces encoded by the S1-0 and S1-2 sub-families. Each violin plot corresponds to a specific targeted structure and sequence couples. For example, Struct. 0 Seq. 0 indicates that the energy distribution corresponds to sequences of the S1-0 sub-family modeled on structures of the S1-0 sub-families and Struct. 2 Seq. S1-0m2 corresponds to the energy distribution of sequences inferred using the latent space of the sub-family S1-2 added to the latent space of the sub-family S1-0 and modeled on structures of the S1-2 sub-family. latent space (ex: Struct. 2 Seq.S1-0m2, orange dots M SQ /Q) with the fourth strategy is closer to the energy distribution of the modeled structures using a sequence and a structure template from different sub-families. This indicates that the fourth strategy is less robust to latent space arithmetic than the other three strategies. No clear differences could be observed between the first, second, and third strategy.  Figure 7). The x axis corresponds to the difference between the mean values computed for query sequences modeled on structures of the same sub-family and mean values computed for query sequences to which latent space of the source sub-family sequences have been added and modeled on structures of the query sub-family (M QS /Q), or source sequences to which latent space of the query sub-family sequences have been added and modeled on structures of the source sub-family (M SQ /Q) (ex: differences between mean of Struct. 0 Seq. S1-0m2 and mean of Struct. 0 Seq. 0 distributions in Figure 7). Points in the red area correspond to mean distribution values from generated sequences whose modeled structures have a higher energy than models created using pairs of sequences/structures from different sub-families. Points in the blue area correspond to mean distribution values from generated sequences whose modeled structures have a lower energy than models created using pairs of sequences/structures from the same sub-family.

Protein latent space interpolation
Interpolation between encoded sequences in the latent space can be used to "navigate" between proteins of two sub-families. Applied in computer vision, interpolation has proven its capacity to generate meaningful intermediate representations between different input images. In this study, ten pairs of sequences from sub-families S1-0 and S1-4 (respectively blue and red data points in figure  1) were selected to test the capacity of the AAE in this task, and 50 intermediates data points were generated between each pair.
Several interesting point can be observed. When gaps are found in the query sequence but not in the target sequence (and inversely), the gaped area is progressively filled (or break down) starting from the flanking amino acids to the center (or inversely from the center to the flanking amino acids). This indicates an organized and progressive accumulation of amino acids (or gaps) to extend (or shrink) the region of the sequence previously without residues. For instance gap reduction can be observed in the generated sequences between sequence ID 2 of the sulfatase S1-0 family (query) and sequence ID 2196 of the sulfatase S1-4 family (target) at positions 75 to 86 (Figure 9), and can be found in all the other generated intermediates. Amino acids specific to a family are progressively replaced in key positions. For instance, in the previous interpolation between query and target sequences, it can be observed at positions 21 and 22 of the MSA a replacement of residues S and C by G and A (Figure 9).
Most transitions are not abrupt, and do not occur at the 25th-generated intermediate sequences but are smooth and correspond to plausible sequences. An abrupt transition can be observed at positions 53, G (S1-0 query) to S (S1-4 target), and 51 [VI] (S1-0 query) to T (S1-4 target), corresponding to very conserved residues in Prosite motif PS00523 (see Figure 5. The other positions of the motif are less affected by abrupt transition but also appear to have lower variation than other columns. A similar behavior can only be observed for columns 111, T (S1-0 query) to W (S1-4 target), of motif PS00149 (positions 102 to 112). The other positions of the motifs are either conserved (T105, G109, K110) or accepting fluctuations (columns 103, 104, 107).
The ability of the AAE to generate interpolated sequences with emerging or disappearing features of two sub-families, highlights its capacity to generalize the decoding of latent space points not corresponding to encoded sequences and thus not previously observed, and points outside the structured organization of the computed latent space. Figure 9: First 130 amino acids for sequences generated using interpolated latent space. The interpolation is performed between latent spaces of protein ID 2 of the sulfatase S1-0 family (query) and of protein ID 2196 of the sulfatase S1-4 sub-family. Amino acids color coding is based on physo-chemical properties. Large transitions between gaped to amino acids and amino acids to gaps can be observed at positions 75 to 86 and positions 116 to 122. Amino acid columns transformation can be observed at multiple positions: 21 (S to G), 51 (V/I to T/S), 53 (G to S) etc.

15
In this study, a new framework is presented to analyze and explore the protein sequence space regarding functionality and evolution. Some previous attempts, such as the FunFam database [6,7], were built upon CATH protein families and trained in a supervised manner to construct a model specific of a functionality. Variational Autoencoder (VAE) have previously been reported and used to disentangle complex biological information and for classification tasks (such as single cell gene expression data) [23,35] or generation of new molecules (such as drugs) [14,24,25,28,29]. AAEs are able to disentangle the information contained in protein sequences to capture functional and evolutionary specific features and more importantly without supervision. They also have the advantage over VAE to constrain the latent space over a prior distribution which allow sampling strategies to explore the whole latent distribution. It can also be noted that Restricted Boltzmann Machine [34] have also been proposed in a similar task showing very promising results despite the difficulty to train RBM which required Markov chain sampling.
AAEs are trained on protein sequence families known to have different subfunctions. The results highlight the ability of AAEs to separate sequences according to functional and taxonomic properties for the three studied family. This emphasized the ability of the AAEs to extracted and encoded features in the latent space which are biological relevant. Furthermore, and contrary to dimensional reduction techniques, AAE can be used to generate new protein sequences. Latent space arithmetic have been used in image generation tasks with striking ability to produce relevant images with intermediate features. Latent space arithmetic is an interesting concept for protein generation particularly in the context of extracting and transferring features between protein families. Three out of the four different experiments carried on where able to generate sequences with intermediate features as measured from their protein sequence similarity distributions and modeling energy assessments. Biological experiments will be needed to confirm the functional relevance of the transferred features, but the strategies could have many applications should it be validated.
The absence of measured differences between three out of four strategies used to generate intermediate sequences may also indicate that more optimal approaches could be designed. In this regard, the model architecture could also be improved. Currently, the model used as input a filtered MSA, but improved architectures could probably benefit from full protein sequences of different sizes without filtering. It is for instance known that motifs in the TPP and HUP family plays important roles in the family sub-functions ??. As protein specific motifs, they are not necessary conserved and may not reach filtering thresholds. Recent advances have been made regarding the application of deep learning for sequential data [13,27,38,39] and transferring these progresses to the field of protein sequence generation could greatly benefit the design of functional proteins.
The results of this study show that AAE, in particular, and deep learning generative models in general, can provide original solutions for protein design 16 and functional exploration.

The sulfatase family
An initial seed protein multiple sequence alignment was computed from sequences of the protein structures of SulfAtlas [2] database sub-families 1 to 12. This seed was used to search for homologous sequence on the UniRef90 [33] protein sequence database using hmmsearch [10] and with reporting and inclusion e-values set at 1e − 3. A label was assigned to each retrieved protein if the protein belongs to one of the 12 sub-families. The MSA computed by hmmsearch was filtered to remove columns and sequences with more than 90% and 75% gap character respectively. Proteins with multiple hits on different parts of their protein sequences were also merged into a single entry. From 105181 initial protein sequences retrieved by hmmsearch, the filtering steps lead to a final set of 41901 proteins.

HUP and TPP protein families
A similar protocol was followed for the HUP and TPP protein families. Instead of using an initial seed alignment made of sequences of protein structures, the CATH protein domain HMM [21,31] was used to search for homologous sequences in the UniRef90 database. CATH model 3.40.50.620 corresponds to the HUP protein family and model 3.40.50.970 corresponds to the TPP protein family. A sequence filtering pipeline identical than the one used for the sulfatase family was applied to each of the resulting multiple sequence alignment. The final number of protein in each dataset was: 25041 for the HUP family (32590 proteins before filtering) and 33693 for the TPP family (133701 before filtering).

Generative Adversarial Network
A complete description of Generative Adversarial Network can be found in Goodfellow et al. [12]. To summarize, the GAN framework corresponds to a min-max adversarial game between two neural networks: a generator (G) and a discriminator (D). The discriminator computes the probability that an input x corresponds to a real point in the data space rather than coming from a sampling of the generator. Concurrently, the generator maps samples z from prior p(z) to the data space with the objective to confuse the discriminator. This game between the generator and discriminator can be expressed as: (1)

Adversarial auto-encoder
Adversarial autoencoders were introduced by Makhzani et al. [19]. The proposed model is constructed using an encoder and decoder networks, and a GAN network to match the posterior distribution of the encoded vector with an arbitrary prior distribution. Such, the decoder of the AAE learned from the full space of the prior distribution. The model used in this study to compute the aggregated posterior q(z|x) (the encoding distribution) using a Gaussian prior distribution. The mean and variance of this distribution is predicted by the encoder network: ). The re-parameterization trick introduced by Kingma and Welling [16] is used forback-propagation through the encoder network.
Network architecture. Three different architectures were evaluated. The general architecture is as follows and Table 2 provided an overview of the differences between architectures. A representation of architecture number 3 can be found on Supplementary Figure 10. The encoder comprises one or two 1D convolutional layers with 32 filters of size 7 and with a stride of length 2, and one or two densely connected layers of 256 or 512 units. The output of the last layer is passed to two different densely connected layers of hidden size units to evaluate µ and σ of the re-parameterization trick [16]. The decoder is made of two or three densely connected layers of the length of the sequence family time alphabet units for the last layers and of 256 or 512 units for the first or the two first layers. The final output of the decoder is reshaped and a softmax activation function is applied which corresponds to a probability for every positions associated to each possible amino acids. To convert the probability matrix of the decoder to a sequence, a random sampling according to the probability output was performed at each position. The selected amino acid at a given position is therefore not necessary the amino acid with the highest probability.
The discriminator network is made of two or three densely connected layers, the last layer has only one unit and corresponds to the discriminator classification decision through a sigmoid activation function, the first or the first two layers are made of 256 or 512 units. Training. The network was trained for each of the protein family independently. The autoencoder is trained using a categorical cross-entropy loss function between the input data and the predicted sequences by the autoencoder. The discriminator is trained using binary cross-entropy loss function between the input data encoded and the samples from the prior distribution.

Dimensionality reduction
The AAE model can be used to reduce the dimensionality of the sequence space by setting a small latent size. Two dimensionality reductions were tested with latent size of 2 and 100. Latent size of 2 can be easily visualized and a larger latent size of 100 should represent the input data more efficiently as more information can be stored.

Clustering
HDBSCAN [4,20] was used to cluster the sequences in the latent space due to its capacity to handle clusters of different sizes and densities and its performances in high dimensional space. The Euclidean distance metric was used to compute distances between points of the latent space. A minimal cluster size of 60 was set to consider a group as a cluster as the number of protein sequences is rather large. The minimal number of samples in a neighborhood to consider a point as a core point was set to 15 to maintain relatively conservative clusters.

Functional and taxonomic analyses
Enzyme functional annotation (EC ids) and NCBI taxonomic identifiers were extracted when available from the Gene Ontology Annotation portal (January 2019) using the UniProt-GOA mapping [15]. Protein without annotation were not taken into account in these analyses. The annotation homogeneity was computed for each cluster. Considering a cluster, the number of different EC ids and taxonomic ids were retrieved. For each different EC ids (taxonomic ids) its percentage in the cluster was computed. An EC id (taxonomic id) of a cluster with a value of 90% indicates that 90% of the cluster members have this EC id (taxonomic id). A cluster with high homogeneity values correspond to functionally or evolutionary related sequences. Homogeneous clusters computed from the AAE encoded space will therefore indicates the ability of the AAE model to capture and to distinguish protein sequences with functionally or evolutionary relevant features without supervision.

Latent space arithmetic
Subtraction and addition of latent spaces have been shown to be able to transfer features specific to some sub-group of data to other sub-group of data. This property is tested in the context of protein sub-families. Different arithmetic strategies ( Figure 4) were tested between latent spaces of a query sub-family and a different source sub-family with the aim to transfer features of the source sub-family to the query sub-family. A first strategy consists to add the mean latent space, computed using the encoder on the sequences of the source sub-family, to the encoded sequences of the query sub-family. The second tested strategy differs from the first one by subtracting the mean background latent space, computed from the latent space of all sub-families, from the latent space of the query sub-family. The third strategy differs to the second as the background strategy is computed using all sub-families except sub-families source and query. Finally, in the fourth strategy, the subtraction is performed using a local KD-tree to only remove features shared by closest members of a given query and the addition is performed by randomly selecting a member of the source family and its closest 10 members. For each strategy, new sequences were generated using the latent spaces of all query proteins in the sub-families. Thus, for one original encoded protein sequences there is a direct corresponding with the original amino acid sequence and the amino acid sequences generated with the different strategies and different source sub-families. The generate sequences by latent space arithmetic are compared to the initial sub-families in terms of sequence and structural constrains.
To evaluate the generated sequences by latent space arithmetic, the sequences are compared to themselves and to the biological sequences of the two initial sub-families using protein sequence similarity computed using a Blosum 62 substitution matrix. The protein sequence similarities inside a sub-family and between sub-families are also computed. The distributions of sequence similarities allow to explore the ability of the latent space arithmetic operations and of the decoder to produce meaningful intermediate protein sequences from unexplored encoded data points.
Protein structural models are computed using the structures of the initial sub-families as template for MODELLER [36] and evaluated using the DOPE score [30]. Models are computed using the generated sequences by latent space arithmetic on template structures of their source and query sub-families. The DOPE energies of the modeled structures are compared to structural models computed as references. The first references are structural models computed using the sequences and template structures of the same sub-families, which should provide the best DOPE energies (best case scenario). The second references are structural models computed using the sequences and template structure of different sub-families (ex: sequences from source sub-family and template structures from the query sub-family or inversely, sequences from query sub-family and template structures from the source sub-family), which should provide the worst DOPE energy (worst case scenario). If the generated sequences by latent space arithmetic correspond to intermediate proteins with properties from two sub-families, they should have intermediate DOPE energies when compared to the best and worst case scenari.

Latent space interpolation
Ten pairs of protein sequences were randomly chosen between sub-families S1-0 and S1-4. The two sub-families were chosen based on their opposite positions in the projection performed with the AAE using two dimensions (see Figure 1). The coordinates of the selected sequences in the encoded space with 100 dimensions were retrieved and spherical interpolation using 50 steps were performed for each of the pairs. Spherical interpolation has previously been reported to provide better interpolation for the generation of images [37]. A linear interpolation was also tested but no clear differences could be observed. The interpolated points were given to the decoder and new sequences were generated according 20 to the procedure previously describe.

Acknowledgements
TBF thanks the NVIDA society for providing a TitanXp GPU to perform computations. Figure 10: The Adversarial AutoEncoder architecture number 3 presented in Table 2. The discriminator (in red) take as input data from a prior distribution or the latent space computed by the encoder/generator. Using a sigmoid activation function, the discriminator is trained to distinguish between the two types of data. By updating the weight of the encoder/generator based on the discriminator performances, the encoder/generator learn to approximate the prior distribution and fool the discriminator. The autoencoder architecture (in blue) corresponds to a variational autoencoder. Latent space is decoded by a decoder and new sequences are generated using a softmax activation function.   First strategy consists to add the mean latent space, computed using the encoder on the sequences of the source sub-family, to the encoded sequences of the query sub-family.

Strategy 2
Second tested strategy differs from the first one by subtracting the mean background latent space, computed from the latent space of all sub-families, from the latent space of the query sub-family.

Strategy 3
Third strategy differs to the second as the background strategy is computed using all sub-families except sub-families source and query.

Strategy 4
In the fourth strategy, the subtraction is performed using a local KD-tree to only remove features shared by closest members of a given query and addition is performed randomly selecting a member of the the source family and it's closest 10 members.