Prediction of candidate small non-coding RNAs in Agrobacterium by computational analysis

Small non-coding RNAs with important regulatory roles are not confined to eukaryotes. Recent work has uncovered a growing number of bacterial small RNAs (sRNAs), some of which have been shown to regulate critical cellular processes. Computational approaches, in combination with molecular experiments, have played an important role in the identification of these sRNAs. At present, there is no information on the presence of small non-coding RNAs and their genes in the Agrobacterium tumefaciens genome. To identify potential sRNAs in this important bacterium, deep sequencing of the short RNA populations isolated from Agrobacterium tumefaciens C58 was carried out. From a data set of more than 10,000 short sequences, 16 candidate sRNAs have been tentatively identified based on computational analysis. All of these candidates can form stem-loop structures by RNA folding predictions and the majority of the secondary structures are rich in GC base pairs. Some are followed by a short stretch of U residues, indicative of a rho-independent transcription terminator, whereas some of the short RNAs are found in the stem region of the hairpin, indicative of eukaryotic-like sRNAs. Experimental strategies will need to be used to verify these candidates. The study of an expanded list of candidate sRNAs in Agrobacterium will allow a more complete understanding of the range of roles played by regulatory RNAs in prokaryotes.

species [1] . Identification of conserved regions outside of protein-coding genes [4] , combined with conservation of sequence characteristics expected of non-coding RNAs (stem-loops) [5] , has been used successfully to identify them. In other searches, intergenic regions were scanned for promoters and the characteristic DNA sequence and structure of a rho-independent terminator of transcription [6] . Most previous studies employed predominantly computational approaches to predict sRNA genes [7] . These screens were primarily based on searches for sequence conservation among closely related bacteria or searches for promoter and terminator sequences in intergenic regions [4][5][6]8] . The expression of many of the predicted sRNAs was confirmed by northern analysis of total RNA isolated under a set number of growth conditions [7] .
Agrobacterium is well known for its natural capability of trans-kingdom DNA transfer [9] . Although a large body of literature has been accumulated from research on this important bacterium, there is currently no information about the presence of small non-coding RNAs and their genes in the Agrobacterium genome. The aim of this study was therefore to identify potential Agrobacterial sRNAs using computer-based methods from a short RNA sequence data set recently obtained by a high-throughput sequencing technology.

Short RNA sequences obtained
A set of approximately four million short RNA sequences was provided by Dr Mingbo Wang (CSIRO Plant Industry, Canberra). They were obtained by high-throughput deep sequencing, using the Solexa sequencing technology (Illumina Company, USA) of total RNAs extracted from the plant Arabidopsis and Agrobacterium tumefaciens strain C58 grown in the presence of acetosyringone. The RNA samples from the two species had been mixed together in a 10:1 ratio prior to sequencing to reduce the cost. Each sequence in the data set was up to 36 bp in length, including the adaptor used in the Solexa sequencing technology (http:∥www.illumina.com).

Data sorting and species origin search
Since the millions of short RNA sequence reads were derived from two species and each sequence may contain certain length of the adaptor used at the 3' end, the first tasks were to find those sequences that belong to Agrobacteruim and to remove the adaptor sequence. Duplicate sequences were tallied, then excised to form a non-redundant set of sequences. Agrobacterium tumefaciens C58 genome sequences were extracted from NCBI (http:∥ncbi.nlm.nih. gov) with which the sequence data was compared using a program called SOAP (http:∥soap.genomics. org.cn), designed for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. The adaptor sequences were filtered by SOAP, and the remaining short RNAs were aligned to Agrobacterium genome sequences. Short RNA sequences were annotated if an adaptor sequence was detected and the insert sequence matched well with Agrobacterium gene sequences. The read was converted to complementary sequence if mapped on the reverse strand. The results were filtered to only retain the top one match for each one. Sequences less than 18 nt were ignored and one mismatch was allowed in this study.
The output of SOAP included short RNA sequence and its ID, sequence quality, total number of matches found for a sequence (number of hits), sequence length, strand matched, match name, match start coordinate, number of mismatches and mismatch description. The results generated from SOAP were combined with the information obtained from Illumina-Solexa sequencing (such as the number of reads, match stop coordinate and so on) by a program designed by Mr. Andrew Spriggs of CSIRO Plant Industry to create a table that involves all of the datasets.

Computational searches of candidate sRNAs
To identify candidates in our investigations, we took the computational approach that was based on sequence conservation and prediction of secondary structures in RNA folding.

Selection of intergenic regions
The short RNA sequences residing in intergenic regions of the Agrobacterium genome were identified based on the gene annotations through GenBank accession number of their matched names from NCBI. A coding-region (CDS) was defined as a genomic region that contains an open reading frame (ORF) on either of the two strands, whereas an intergenic region was defined as the one not overlapping with any annotated CDS, rRNA, tRNA, or miscellaneous RNA feature on either strand. Putative 5'UTR regions, 3'UTR regions, flanking regions, as well as functionunknown regions were also considered in this study. Functional regions include protein-coding genes, rRNA genes, tRNA genes, promoters, terminators, regulatory regions and repeats.
Identification of candidate sRNA genes by homology It was assumed that homologous RNA structures would show a reasonable degree of conservation at the sequence level for a given set of genomes. A file containing all known small non-coding RNA sequences of E. coli as well as other bacterial noncoding RNA sequences was downloaded from noncoding RNA database (http:∥ncrnadb.trna.ibch. poznan.pl/download.html) and used as a starting point for our homology search. The sequences residing in intergenic regions, putative 5'UTR and 3'UTR regions, flanking regions, as well as functionally unknown regions in Agrobacterium were used as queries and compared by BLAST (http:∥ncrnadb.trna.ibch. poznan.pl/blast.html) to the known small non-coding RNA sequences of these bacterial genomes.

Screening candidate sRNAs by RNA folding predictions
Since the length of bacterial sRNAs are usually >49 nt, the adjacent sequences with 70 nt in length from both upstream and downstream regions of selected candidates were added based on the Agrobacterium genome data from NCBI. The "extended" short RNA sequences were then used for secondary structure predictions. Each strand of the extended sequence was assimilated to a RNA molecule and folded using the RNAfold web server (http:∥rna.tbi.univie.ac.at/cgibin/RNAfold.cgi) with the program's default settings.

Basic data obtained
After searching against the Agrobacterium full genome sequence and sorting by the SOAP program, a file of more than 10,000 short RNA sequences was generated. The output of the file included each short RNA sequence of the soil bacterial origin (based on homology search), its ID, matched gene name, total number of matches found for the sequence (number of hits), strand matched, match start coordinate, match stop coordinate, sequence length, mismatch description and number of sequence reads. After adaptor subtraction, these short RNA sequences were in the size range between 18 nt and 33 nt. They were presumed to be authentic sRNAs or degradation products of long RNAs from normal gene transcripts. Most of the sequences matched with Agrobacterium 16S ribosomal RNA gene sequence, presumably due to its high abundance in the genome. The total number of reads were 30 998 in the file and some sequences had very high reads (more than 1 000), indicating their high frequencies in the RNA population. The result on size distribution is summarized in Fig. 1.
In general, the genome of Agrobacterium is quite compact. Consistent with this, only 4% of the short RNA sequences resided in intergenic regions, whereas more than 90% were from functional regions, and the other 5% occurred in putative 5'UTR, 3'UTR, and flanking regions, as well as function-unknown regions. Sequences which did not match the functional regions were used for further analysis. In total, 869 identified short RNA sequences reside in intergenic regions, putative 5'UTR and 3'UTR regions, flanking regions, as well as function-unknown regions.

Identification of candidate sRNA genes by homology
As a starting point for detecting sRNAs in Agrobacterium, we considered a number of common properties of the previously identified sRNAs of E. coli that might serve as a guide. We define sRNAs as relatively short RNAs that do not function by encoding a complete ORF. Since all known sRNAs are encoded within intergenic regions (defined as regions between ORFs), for the conservation analysis, sequences of potential sRNAs in Agrobacterium were used as queries and compared against bacterial non-coding RNA database by BLAST program. However, in this homology search, none of the potential Agrobacterial sRNAs shows any significant sequence similarities.

Screening candidate sRNAs by RNA folding predictions
Among the 869 identified sRNA sequences, many were overlapped, matching the same genomic regions. For these overlapped RNA sequences, the relatively long ones with relatively high frequencies were selected (about 50) and subjected to RNA folding analysis for secondary structure prediction. Since known bacterial sRNAs generally range from 50 to 250 nucleotides in length, adjacent sequences needed to be added to the selected Agrobacterium sRNAs for a RNA structure prediction. Out of the 50 short RNA sequences analyzed, 16 candidates of potential small non-coding RNAs in Agrobacterium were tentatively identified, based on their putative secondary structures ( Table 1 and Appendix). The remaining 34 selected sRNAs did not form significant secondary structures, and were therefore not investigated further. The 16 candidate sRNAs ranged from 148 to 178 nucleotides in length and 7 were from intergenic regions. Two of them showed very high sequence frequencies (1871124 and 1167533, Table 1).
Putative secondary structures of the 16 candidates are shown in Appendix. The majority of the structures are rich in GC base pairs (ranging from 50 % to 72 %). Three of them (345524, 473095 and 1010101) are followed by a short stretch of U residues ( > 3 uridines), indicative of a rho-independent transcription terminator, whereas some of the short RNAs are found in the stem region of the hairpin (such as 473095, 583983, 1076360 and 1871124), indicative of eukaryotic-like sRNAs. The differences in GC content could point to the different structural requirements associated with the function of the potential sRNAs. Two of the structures contain extensive duplex regions in which the sequenced short RNA fragments reside, raising the possibility that they may be eukaryotic miRNA-like sRNAs.

Limitations of this study
In our study, intergenic regions as well as putative 5'UTR and 3'UTR regions were considered. However, potential small non-coding RNAs that reside within annotated regions (such as coding-regions) were not included in our search. This was based on the consideration that all known bacterial sRNAs are encoded at genetic loci other than those of the target genes (trans-encoded) [10] . However, all currently known plasmid-borne sRNAs are encoded at the same genetic loci as the target genes (cis-encoded) and act as antisense RNAs [11] . The restriction of the computer search to intergenic regions as well as 5' and 3' UTR may have excluded the class of cis-encoded anisense RNAs that are encoded complementary to their target [2] .
The Solexa sequencing technology used in this study is not best suited for looking for E. coli-like bacterial sRNAs as it gives only 36 nt sequences in length. Since the sRNAs were size-fractionated by gel electrophoresis before adaptor ligation and sequencing, most sRNAs of longer than 36 nt may have been excluded in the sequenced population. Thus, the number of bacterial sRNAs which can be found in the present study could be well below the real number existing in Agrobacterium. However, as Agrobacterium interacts closely with host plants, it is possible that it encodes eukaryotic-like sRNAs of 20-30 nt in size. Such sRNAs are likely to be identified in the Solexa sequencing data. Considering this possibility, some of the 16 candidate sRNAs from Agrobacterium found in this study could be eukaryotic-like sRNAs rather than E. coli-like sRNAs. Further work, such as RNA gel blot analysis, is needed to examine this possibility.

The limits of sequence homology searching
In this study, we have tried but failed to identify Agrobacterium sRNAs through sequence conservation analysis against known bacterial non-coding RNAs. The vast majority of bacterial sRNAs known to date have been identified in E. coli. With the exception of a few highly conserved sRNAs such as tmRNA and RnpB, most E. coli sRNAs are well conserved only among closely related species such as Salmonella sp. and Yersinia sp. [12] . Consequently, relatively few putative sRNAs have been identified in other species based solely on primary sequence homology with known E. coli sRNAs [13] . This could also account for the failure in our analysis. Furthermore, several recent studies have shown that functional sRNA homologues in different species often lack significant sequence similarity [2] . Directly applying the bioinformatics approaches used in E. coli to identify sRNAs in other bacterial species have had only limited success [14] . The principal impediment to applying these approaches to other bacteria is that accurately predicting either promoters or transcription factor binding sites requires reliable species-specific consensus sequences, few of these have been experimentally determined in bacterial species other than E. coli [14] .

Secondary structure prediction of candidate sRNAs
One important criterion in predicting E. coli sRNA genes is that they should be inspected for strong rhoindependent termination signals, defined as GC-rich (> 59 %) stem-loop structures followed by a stretch of > 3 uridines [6] . In this study, putative secondary structure of candidate sRNA 345524 do form stemloop structure with 62 % GC content and followed by a short stretch of U residues, which could be indicative of E. coli-like sRNAs. Among the candidates found in this study, nearly a quarter of the short RNAs are embedded in the stem region of the hairpin (such as candidate sRNAs 473095, 1871124), which resembles the feature of eukaryotic-like sRNAs. In Solexa Putative secondary structure of 880854. GC content, 63% sequencing, the sequence contained in each of the 36 nt reads most likely comes from the 5' terminus of a RNA molecule. Interestingly, most of the hairpin-like structures are formed by the sequences downstream of the short RNAs. This may suggest that either the short RNAs are the 5' terminal region of longer bacteriallike sRNA molecules or they are eukaryotic sRNAs processed by RNaseIII-like enzyme from the hairpinloop structures.

Identification of sRNAs based on computational predictions
While the most broadly used approach in eukaryotes has been to create cDNA clones of small transcripts [15] , a parallel approach has also been used in bacteria [7] . Cloning and sequencing are the methods of choice for small regulatory RNA identification. By using deep sequencing technologies one can now obtain up to a billion nucleotides, and tens of millions of sRNAs, from a single library [16] .
Several characteristics of sRNAs make them difficult to identify by experimental techniques or by straightforward computational approaches [2] . RNA genes have not been annotated during genome sequence analysis due to their lack of defined sequence features [7] . RNA genes are also poor targets for mutation screens due to their small size and because they are resistant to frameshift and nonsense mutations since they do not encode proteins [17] . However, in the past few years, several systematic searches have led to the identification of more than 100 small RNA genes in E. coli [12] .
The first systematic genome-wide screens employed computational approaches to predict sRNA genes [18] . Rivas et al [5] developed an algorithm that relied on conservation of RNA structure elements rather than on primary sequence conservation. A computer program (Intergenic Sequence Inspector (ISI)) designed by Pichon and Felden (2003), which automatically selects candidate intergenic regions and displays sequence and structural signatures of RNA genes, can help in identifying bacterial sRNAs. Overall, hundreds of putative sRNA genes were predicted and are awaiting examination [18] .
Future prospects: target identification of sRNAs in bacteria sRNAs have now made their impressive debut in a variety of bacteria, and are proving to be important components of many regulatory circuits [19][20][21] . Although methods for finding the sRNAs themselves are available and continue to develop, the next challenge will be to develop equally effective methods for finding the targets of these RNAs [1] . For most purposes, in particular for an understanding of the mechanisms of regulation, as well as other potential functions, primary (true) targets should be identified [22] .

CONCLUSION
The current work serves as a blueprint for the initial prediction of a group of potential novel sRNAs in bacteria. In order to find candidate sRNAs in Agrobacterium, a computational approach was used in this study and eventually 16 candidates were predicted. However, we only compiled a list of the candidates that were not yet verified experimentally. Therefore, in future work, these candidates need to be tested and examined using experimental strategies. Also, sequencing technologies that allow for longer sequence reads, such as the 454 technology, may have to be used to search for more Agrobacterial sRNAs. We anticipate that the study of an expanded list of candidate sRNAs in Agrobacterium will allow a more complete understanding of the range of roles played by regulatory RNAs in prokaryotes and their interactions with host organisms.