In silico identification of opossum cytokine genes suggests the complexity of the marsupial immune system rivals that of eutherian mammals

Background Cytokines are small proteins that regulate immunity in vertebrate species. Marsupial and eutherian mammals last shared a common ancestor more than 180 million years ago, so it is not surprising that attempts to isolate many key marsupial cytokines using traditional laboratory techniques have been unsuccessful. This paucity of molecular data has led some authors to suggest that the marsupial immune system is 'primitive' and not on par with the sophisticated immune system of eutherian (placental) mammals. Results The sequencing of the first marsupial genome has allowed us to identify highly divergent immune genes. We used gene prediction methods that incorporate the identification of gene location using BLAST, SYNTENY + BLAST and HMMER to identify 23 key marsupial immune genes, including IFN-γ, IL-2, IL-4, IL-6, IL-12 and IL-13, in the genome of the grey short-tailed opossum (Monodelphis domestica). Many of these genes were not predicted in the publicly available automated annotations. Conclusion The power of this approach was demonstrated by the identification of orthologous cytokines between marsupials and eutherians that share only 30% identity at the amino acid level. Furthermore, the presence of key immunological genes suggests that marsupials do indeed possess a sophisticated immune system, whose function may parallel that of eutherian mammals.


Background
The marsupial and eutherian (placental) lineages diverged approximately 180 million years ago. Marsupials are chiefly distinguished from other mammals by their unique reproductive strategies, with young born in an immature state with only the most rudimentary neurolog-ical and immunological systems [1]. At birth, the animal manoeuvres its way to a waiting teat, where it attaches until it reaches a state of maturity that allows it to function independently. Marsupials possess lymphoid tissue and cellular components that are structurally similar to those of other mammals. Key antigen receptor and recognition molecules including Major Histocompatibility (MHC) Class I, II and III [2], T Cell Receptors alpha, beta, gamma and delta [3,4], Toll-like receptors [5] and immunoglobulins [6] have been characterized.
Identification of divergent marsupial immune genes is important for two reasons. Firstly, unsuccessful attempts to isolate T cell derived cytokines in the laboratory has led some authors to suggest that the marsupial immune system is 'primitive' and does not possess the level of complexity demonstrated by eutherians such as humans and mice. The fact that some T cell driven responses are also aberrant adds to this argument. Marsupials appear to have delayed skin graft rejection [18] and antibody class switching [19], together with an apparent lack of an in vitro Mixed Lymphocyte Response [20]. Elucidation of genes involved in specific immunity will help us to determine whether the apparently 'simple' immune responses generated by marsupials are genetically hardwired.
The second reason for identifying divergent immune genes in the marsupial genome is to develop marsupial specific immunological reagents. To date, most assay systems employed to characterise cells and their function rely on eutherian reagents or culture techniques developed in eutherian species. Where low levels of cross reactivity exist between marsupials and these model species, the usefulness of the data generated from such assays is limited. Identification of key cell markers, such as CD4 and CD8 will allow us to generate marsupial-specific reagents, which would then be used to gain a better understanding of the marsupial immune response.
Difficulties associated with identifying rapidly evolving cytokines are not limited to marsupials. The chicken IL-2 gene took seven years of focused effort to identify [21], and was eventually found using expression strategies and not heterologous cloning techniques. The recent sequencing of the complete genomes of a large number of noneutherian vertebrates will expedite the isolation and characterization of these immune genes in distantly related species. However, currently automated annotation techniques are not sensitive enough to identify many of these molecules outside the eutherian lineage.
The first marsupial genome was recently sequenced by the Broad Institute. The subject of this project, Monodelphis domestica, is a South American opossum. It is a well-recognised biomedical model in the study of comparative physiology, immunogenetics, cancer development and disease susceptibility. Two publicly available annotations of this genome have been generated. Ensembl have produced a gene build with their automatic pipeline [22], which relies principally on GeneWise [23], while the UCSC genome browser provides several annotation tracks with similarity features and gene models, for example chained TBLASTN alignments of human proteins, BLAT alignments of Ref-Seq mRNAs, and Genscan [24] and N-SCAN [25] predictions. With the exception of the Genscan predictions, which are ab initio gene predictions based on genomic sequence only, the gene builds rely on cross species homology, as no large-scale opossum EST projects have been completed yet and there are only 425 known opossum protein sequences in GenBank. In most cases, Ensembl and the UCSC genome browser were unable to identify highly divergent cytokine genes such as IL-2, 4 and 13.
To overcome this shortcoming in the automated annotation of the opossum genome and to start to address uncertainties about immune function in marsupials, we have adopted a manual, expert-curated approach to annotating highly divergent genes. The critical first stage of this is the careful identification of the genomic region containing the gene. This is performed using a sensitive TBLASTN search. HMMER [26] can also be useful at this stage. Frequently, it is necessary to first narrow the search to the syntenic region by identifying conserved flanking genes.
Having identified similarity features, gene prediction is performed on genomic sequence extracted from the region. The accuracy of gene prediction is dependent on the prediction method. As with the automated annotations, we favour gene predictors that incorporate information from orthologous sequences into the prediction process. In addition to GeneWise and N-Scan, there are now several such methods available including Procrustes [27], HMMgene, [28] GenomeScan [29], and Augustus+ [30]. Procrustes and the default GeneWise algorithm perform spliced alignment. Augustus+ uses an interesting approach, which constrains predicted genes to incorporate user-supplied hints. However, it is not particularly convenient for manual use or use by biologists lacking scripting skills. While not the only possible choice, we have found GenomeScan to be both convenient and reasonably accurate (based on comparison with known eutherian sequences). It is worth noting that there is another class of gene prediction methods that obtain homology information from syntenic regions of other genomes. These include TwinScan [31], which is asymmetric and predicts genes in one genome only and SLAM [32], which simultaneously aligns two genomes and predicts genes in both. These methods were unlikely to be useful in our study as we were looking for genes that are highly divergent and difficult to align at the genomic level. Finally, a comparison of predicted results with known eutherian sequences and curation of the result was undertaken if required. Our success with this strategy suggests that this method will be applicable to the identification of rapidly evolving gene families in other distant vertebrate species.

Overview
In silico searching revealed a total of 23 cytokine sequences, all of which are described in the opossum for the first time and 5 of which are novel for any marsupial species (see Table 1). A number of critical cytokine recep-tors are also identified, as are the sequences for the hallmark T cell cluster of differentiation markers, CD4 and CD8.
The majority of genes reported in this study were identified using sensitive peptide BLAST searches ( Table 2). The most divergent genes, interleukins 2, 4 and 13, were identified using synteny searches. Properties of the putative proteins identified in this study, predicted structures and comparison with human sequences are summarised in Tables 1 and 2. Sequence data of the predicted proteins are available online [33].

Isolation of interleukins using BLAST and synteny searches
Interleukins 2, 4 and 21 and their common gamma chain receptor were identified using both BLAST and syntenic strategies. IL-21 was identified by a sensitive TBLASTN search (e-value = 2e-18) on Chromosome 5:7034081-7057815. The predicted protein is of similar size and contains the same number of exons as human IL-21 [see Additional File 1]. The signal peptide was predicted to be encoded within the first 21 amino acids (score = 7.6, p = 0.06), with N-linked glycosylation sites predicted at positions 46 and 106 and O-linked glycosylation of threonine predicted at position 55. Instability motifs (ATTTA) were Opossum IL-2 was found by searching genomic sequence flanking IL-21, which is adjacent to, and has significant structural homology with IL-2 in humans [see Additional File 2]. This strategy was adopted since the alignment of human IL-2 against the opossum genome using TBLASTN resulted in no hits. A 395 kb region adjacent to IL-21 was extracted and 15 genes within this region were predicted with GENSCAN. The predicted gene most similar to IL-2 was identified using BLASTP. The sequence was extracted and GenomeScan was used with an IL-2 orthologue to obtain a more accurate prediction. Opossum IL-2 was located on Chromosome 5:7191593-7196834 (Fig 1) and contains several conserved residues essential for biological activities, including two cysteine residues that provide structural stability [34] and the amino acids leucine and aspartic acid within helix A, which are crucial for binding of the ligand to IL-2Rβ in humans [35]. Also well conserved is a glutamine residue in the D helix, which is directly involved in the binding of the IL-2Rγ chain [36]. Similar to the human sequence, the putative peptide is 142 amino acids in length and contains 4 exons. A signal peptide that contains a potential O-linked glycosylation site (position 13 -Thr) is predicted from positions 1-22 (score = 9.9, p = 0.03). A potential N-linked glycosylation site, not found in humans or mice, but present in several eutherians including the cat and dog, is found at position 101. Four mRNA instability motifs (ATTTA) are present upstream of the poly(A)+ signal.
Opossum IL-2Rγ was identified using TBLASTN (e-value = 8e-119) (  (Fig 2). It has low levels of identity to human IL-4 (30.8%). Two putative N-linked glycosylation sites were identified. SPScan was unable to predict a putative signal sequence although two instability motifs (ATTTA) were recognised in the 3' UTR region. Despite the variation in sequence between the predicted opossum and Alignment of IL-2 amino acid sequences Figure 1 Alignment of IL-2 amino acid sequences. Diamonds denote functionally important residues [64]. Inverted triangles indicate cysteine residues involved in disulfide bonds in human protein [64]. human IL-4 protein sequences, disulfide bonds that join helix B to the CD loop and that are important for biological activity are conserved [37].
IL-4 and IL-13 were identified simultaneously using a syntenic approach since they sit adjacent in the human genome [see Additional File 5]. Opossum IL-13 (Chromosome 1:307682382-307686155) is found 74.30 kb upstream from opossum IL-4 and does not contain any glycosylation sites. Alignment with mammal and chicken protein sequences (Fig 3) revealed a truncation of 32 amino acids from the 5' end of the peptide in opossum IL-13. This is probably due to incorrect gene prediction, a fact supported by the absence of signal peptide and any instability motifs.
Opossum IL-6 was identified using a sensitive TBLASTN search (e-value = 0.08). Opossum IL-6 is located on Chro- Opossum IL-6 has maintained significant structural similarities to human and other mammalian IL-6 genes despite its comparatively low sequence identity. The number and position of cysteine residues in opossum IL-6 are identical to those found in eutherian and chicken sequences. An arginine molecule in helix D that is involved in IL-6β binding [38] is also conserved.
Opossum IL-12 alpha chain (chr7:260,616,009-260,626,803) was identified using a TBLASTN search and is predicted to be 58% similar to its human orthologue [see Additional File 7]. Cysteine residues are conserved between the marsupial, eutherians and chicken sequences.
IL-10 family members were identified in two clusters.  [39]. Orthology of the IL-10 family cytokines with their eutherian counterparts was confirmed by phylogenetic analysis [see Additional File 14]. All putative IL-10 family members clustered closely with their eutherian orthologs.

Isolation of cluster of differentiation markers using TBLASTN
CD4 [see Additional File 15] and CD8 [see Additional File 16] were identified by TBLASTN search and found on chromosome 8 (104157682-104183462) and chromosome 1 (716671734-716675645) respectively. Their number of amino acids and potential glycosylation sites are noted in Table 2. Neither we, nor Ensembl, were able to successfully predict the terminal exons of these two genes.

Isolation of interferons using BLAST, synteny and hidden Markov models
Type I IFNs Nine type I IFN coding sequences and 2 pseudogenes were identified in the opossum genome using BLAST strategies. Seven IFN-α genes, along with single copies of IFN-β and IFN-κ were identified. Predicted opossum IFN-α sequences share 68-78% identity and 78-99% similarity at the amino acid level. IFN-α and -β genes were located in a cluster on Chromosome 6, with IFN-κ situated approximately 12 kb away (Fig 4). Phylogenetic analysis revealed that opossum IFN-α sequences were interspersed with known tammar wallaby IFN-αs (Fig 5).

Interferon gamma (IFN-γ) and interferon gamma receptor (IFN-γR2)
The signal transducing chain of the Interferon gamma receptor was identified in the opossum genome on Chromosome 4 (14328267-14355149). It shares 29-46% amino acid identity with eutherian and chicken sequences [see Additional File 17]. The ligand of IFN-γR2, IFN-γ, was not identified in the genome, despite exhaustive searches including searches using the Hidden Markov model (HMM) containing predicted ancestral sequences. According to the gene organization in other vertebrates (including birds and fish), IFN-γ should be adjacent to IL-22 and IL-26 on chromosome 8. A large gap (9.6 kb) was located in this region, suggesting that IFN-γ was simply not sequenced, rather than being absent from the genome. However, the availability of BAC end sequences generated by the genome sequencing project did allow us to identify a BAC (VMRC-18:653P7) that spanned this region. Researchers at the Broad Institute, led by Kerstin Lindblad-Toh and April Cook kindly sequenced this BAC (GenBank Accession: AC190119). IFN-γ was thus identified (Fig 6). It shares 47% amino acid identity with human IFN-γ but predicted glycosylation sites are unique. Alignment of IL-13 amino acid sequences Figure 3 Alignment of IL-13 amino acid sequences. Inverted triangles indicate cysteine residues that form disulfide bonds in the human protein [66]. Dots represent identity to Monodelphis domestica sequence. Sequences used for alignment: Homo sapiens (NP_002179), Mus musculus (NP_032381), Gallus gallus (NP_001007086), Rattus norvegicus (NP_446280), Canis familiaris (NP_001003384), Sus scrofa (Q95J68).

Level of confidence in our gene predictions
Where possible, gene predictions were verified by alignment with known marsupial cDNA sequences, and compared to Ensembl gene predictions and UCSC similarity features. For instance, a known cDNA sequence is available for Trichosurus vulpecula (possum) IL-10 cDNA (Genbank ref: AF026277). Our predicted opossum IL-10 protein shared 76% amino acid identity with possum IL-10, and exon-intron boundaries match. However, despite our use of robust methodologies, we are still not confident with prediction of the most divergent immune genes sequences. Some doubt exists with our predications for IL-4 and IL-13 and the terminal exons of IL-22, CD4 and CD8. Characterisation of their cDNA, together with laboratory-based assays will ultimately confirm the reliability of the predictions reported here.

Discussion
Without EST and protein databases, annotation of distantly related mammalian species such as the marsupials and monotremes is challenging. Neither Ensembl nor UCSC were able to identify IL-2, 4, 13, 22 and IFN-γ. In general, automated gene prediction missed key immune genes because of their low levels of sequence similarity with their eutherian orthologs. We suggest that future studies focusing on in silico mining of divergent genes should take into account gene location and features. Application of this strategy allowed us to successfully identify key immune genes in the opossum genome, which traditional laboratory methods failed to isolate.
Discovery of key cytokines in the opossum genome suggests that a re-examination of immune responses (especially T cell responses) is warranted in marsupials. The peculiarities in class switching and in vitro T cell proliferation, which have previously been observed in marsupials are largely controlled by T cells and their products. The ability to discriminate between classic 'helper' T cells and 'cytotoxic' T cell families will now be possible due to the identification of CD4 and CD8 sequences in the opossum genome. Further, identification of cytokines normally produced by these subsets in eutherian mammals will allow us to investigate Th1 and Th2 profiles that orchestrate immunity to intracellular and extracellular pathogens respectively.
There are a myriad of interactions between cytokines at the cellular level, but the presence of a number of key cytokines orchestrate the global immune response. For example, when the macrophage-derived IL-12 is dominant, Th1 responses predominate resulting in cell-mediated immunity. When the B-cell growth factor IL-4 is dominant, Th2 responses dominate and a humoral immune response is activated [40]. Sequences for both of these genes are present in the opossum genome, along with other classical Th1-(IL-2, IFN-γ) and Th2-(IL-4, IL-5, IL-6, IL-10 and IL-13) associated molecules.
The presence of key cytokines in a marsupial genome strongly suggests that marsupials are capable of complex immune responses comparable to those seen in eutherian mammals. Knowledge of these gene sequences provides a springboard for future studies. For instance, marsupials appear to be susceptible to infection with intracellular pathogens such as herpesvirus and mycobacterial spp [41], indicative of impaired Th1 cytokine responses. The availability of Th1 and Th2 cytokine sequences will allow us to study IL-10 profiles, which are known to play a critical role in the survival of intracellular pathogens by inhibiting the expression of inflammatory cytokines such as IFN-γ and TNF. Meanwhile, studies of Th2 cytokines may focus on protection against parasites. Both American and Australian marsupials co-exist with a range of successful parasites; opossums are reported to have natural trypanosome infection rates of up to 100% [42] and carry nematode burdens in the wild [43], whilst a variety of Genomic organisation and transcriptional directions of type I IFNs on chromosome 6 Figure 4 Genomic organisation and transcriptional directions of type I IFNs on chromosome 6. IFNK  IFNA7  IFNA6  IFNA3  IFNA1  IFNA2  IFNB  IFNA4  IFNA5 Phylogenetic tree showing evolutionary relationship between type I interferon protein sequences   helminth infections are common across a range of Australian marsupials [44].

Kb
Allograft responses can now be studied due to the availability of sequence information for interleukins 2, 4, 21 and IL-2Rγ [45]. The opossum is an important model for tumour immunology since it can be induced to accept melanoma cells at both juvenile and adult life stages [46]. Both IL-2 and IL-24 are associated with melanoma tumour suppression [47] in humans and it is now possible to study the role of these genes in the opossum model, as well as in the maintenance of transmissible allograft tumours in Tasmanian devil facial tumour disease [48].

Conclusion
Here we describe and apply a method to identify divergent immune genes from the genome of a model marsupial, Monodelphis domestica. We are now extending this analysis to characterize the entire opossum immunome. We report here that the opossum genome contains representatives from the major vertebrate immune gene families. These genes appear to be structurally similar, and therefore will most likely prove to be functionally equivalent, to their eutherian homologues. The way is now clear to further probe the genes that orchestrate the marsupial immune response and to investigate the role that these molecules have on maintaining health and influencing disease susceptibility in this unique group of animals.

Data source
Draft sequencing of the genome of a female opossum (Monodelphis domestica) has recently been completed by the Broad Institute [49]. Analysis was performed on assembly MonDom4 (January 2006).

Sequence identification
To optimise the chances of identifying previously undiscovered sequences, our search strategy relied on a preliminary database screen for sequence conservation, together with positional analyses of the gene sequence relative to other genes within the genome (synteny). Finally, the putative sequence was analysed for the presence of biologically significant sites associated with both structure and function in their eutherian homologues.

Similarity searching using BLAST
Sequence similarity searching (TBLASTN) was performed with known eutherian sequences. Positive hits from the BLAST search with good potential were extracted for further structural analysis. When ambiguities existed between alignments from BLAST results, each of the multiple hits were extracted and inspected. Assessment methods for ambiguous hits included tests for reciprocal-besthit where the aligned sequence was blasted against SWISS-PROT and TrEMBL protein databases to confirm preliminary findings. Proteins discovered in BLAST searches were used to mine additional homologues. To do this, parameters were optimised for sensitive searching. In order to increase our ability to detect highly divergent sequences, the BLOSUM 45 similarity matrix [50] was used. Additionally, application of soft-masking and the lowering of the neighbourhood word threshold score to 9 increased the chance of detecting homologous sequences that otherwise might have been overlooked using default parameters.

Synteny analysis
If the protein of interest was not detected by the initial BLAST search, other methods were employed. Similarity searches were performed with genes found in close syntenic regions in the human genome. Syntenic regions were extracted from the opossum database, and passed into GENSCAN [24]. The predicted peptide sequences were analysed by performing similarity searches against the SWISS-PROT and TrEMBL databases using BLASTP and FASTP [51]. In order to improve the accuracy of identified cytokine sequences, the sequence was re-extracted from the opossum database and the putative protein was reevaluated using GenomeScan [29]. Combined results from GenomeScan and GENSCAN were compared with documented structural features of the cytokine.

Additional methods for gene identification
For sequences that were not detected using the above methods, a hidden Markov model (HMM) was built and calibrated using the HMMER 2.3.2 package [52]. The model was built as a multiple local alignment profile with the Krogh/Mitchison substitution weight matrix [53] and used to search the six-frame translation of the opossum genome.
Ancestral sequences were included in the HMM. These were calculated by programs in the Phylip package [54]. PRODIST was used to compute a distance matrix under default settings. After this, the program NEIGHBOUR was used to create a neighbour joining (NJ) tree from the matrix. The tree was rooted with a teleost species. Following this, ProML was set to produce ancestral sequences at each of the NJ tree nodes.

Structural features
Once the gene of interest was located, exon/intron boundaries were identified using the gene prediction programs GENSCAN [24] and GenomeScan [29]. Our experience suggests that some caution is advisable in the interpretation of data from existing gene prediction software; excessively long predicted genes ('thready' gene predictions) due to mis-identification of first exons and merging of adjacent genes, and unlikely predictions of splice sites (based on comparison with orthologous sequences) were the most common problems we observed. Mindful of these limitations, our gene predictions were compared with known gene structures. The presence of signal peptides was predicted by SPScan (Accelrys GCG) and estimation of glycosylation sites were made with NetOGlyc 3.1 [55] and NetNGlyc 1.0 [56]. Finally, sequences were submitted to the PROSITE database [57] for detection of protein family motifs that would confirm gene identify.

Sequence alignments
Sequences from the opossum and other species were aligned using ClustalW [58]. Accession numbers of sequences used in analyses are shown in figure legends. Sequence labels in the alignments are abbreviated by the first letter from the genus with the first two letters from the species name followed by the gene name. In figures, residues with functional importance are highlighted.

Phylogenetic analysis
Neighbour-joining (NJ) trees were constructed using the Jones-Taylor-Thornton substitution model [59] and 500 bootstrap replicates in MEGA 3.1 [60]. The tree, constructed from amino acid sequences, was rooted using chicken sequences.

Sequence identity
Sequence identity and similarity calculations were carried out using GAP (Accelrys GCG), with the Needleman-Wunsch alignment [61], except for IFN-α genes which were calculated in GenDoc [62] using the BLOSUM 35 similarity matrix [50] for comparisons of human and opossum IFN-α genes and BLOSUM 80 matrix [50] for comparisons among opossum sequences. GCG, GENS-CAN, BLASTP and FastA programs were accessed through the Australian National Genomic Information Service (ANGIS) [63].