Patterns of Insertion and Deletion in Mammalian Genomes

Nucleotide insertions and deletions (indels) are responsible for gaps in the sequence alignments. Indel is one of the major sources of evolutionary change at the molecular level. We have examined the patterns of insertions and deletions in the 19 mammalian genomes, and found that deletion events are more common than insertions in the mammalian genomes. Both the number of insertions and deletions decrease rapidly when the gap length increases and single nucleotide indel is the most frequent in all indel events. The frequencies of both insertions and deletions can be described well by power law.


INTRODUCTION
With the successful completion of the genome sequencing projects, the challenge is now to understand the instructions encoded in the genomes. The comparative genomic analysis by cross-species alignment of mammalian genomes is one of the most powerful ways to decipher the evolutionary process of mammalian genomes. One major aim of genomics research is to identify differences between genomes of species or individuals. The differences of genomes require genetic variation. One mechanism that increases genetic variation is mutation. There are many kinds of mutations. A mutation in which one "letter" of the genetic code is changed to another is a point mutation. Lengths of DNA be deleted or inserted in a gene means a deletion or insertion, respectively. Finally, genes or parts of genes can become inverted or duplicated. Previous researches unveiled that insertions and deletions, instead of substitutions, comprise the majority of the genomic divergence [1][2][3][4]. Therefore, the study of the patterns of insertion and deletion is necessary to understand the mammalian evolution.
By examining the homologous protein sequences, de Jong and Rydén (1981) observed that deletions of amino acids occurred about four times more frequently than insertions [5]. Deletion events also outnumbered insertions for processed pseudogenes [6][7][8][9]. Deletions are about twice as frequent as insertions for nuclear DNA, and in mitochondrial DNA, deletions occur at a slightly higher frequency than insertions [10]. Deletion events are also found more common than insertions in both mouse and rat [11][12][13].
There were several studies that focused on the size distribution of insertions and deletions. The exhaustive matching of the protein sequence database found that a power law with an exponent of 1.7 approximates quite closely the observed gap (insertion and deletion) length distribution [14]. The *Address correspondence to this author at the Bioinformatics Center, College of Life Science, Northwest A&F University, Yangling, Shaanxi 712100, China; Tel: (0086) 029 87091060; Fax: (0086) 029 87092262; E-mail: shihengt@nwsuaf.edu.cn studies of pseudogenes suggested that the size distribution of insertions and deletions can be empirically described by power law [7,9]. Qian and Goldstein (2001) examined gaps occured in FSSP database [15], using alignments based on their common structures, and they fitted the probability distribution of gap length to a quadruple exponential function [16]. Goonesekere and Lee (2004) examined the pattern of gaps of 3992 structurally aligned protein domain pairs in SCOP database [17], they found that the distributions of the logarithm of the probability of gaps varies linearly with the length of gap with a break at the gap of length 3 [18].
In this research, the multiple alignments of 19 mammalian genomes were used to analyze the patterns of insertions and deletions. We tested whether deletions always occur more frequently than insertions. Then we studied the length distributions of insertions and deletions.

MATERIALS AND METHODS
The multiple alignments of 28 vertebrate species were downloaded from UCSC Genome Bioinformatics website [19]. Table 1 shows the genome assemblies that were included in the 28-way multiple alignments. Table 2 shows the data used in this research.
The 28-way multiple alignments were built as follows. Firstly, lineage-specific repeats were removed prior to alignment then pairwise alignments with the human genome were generated for each species using BLASTZ [20] from repeat-masked genomic sequence. Pairwise alignments were then linked into chains using AXTCHAIN [21] that finds maximally scoring chains of gapless subsections of the alignments organized in a k-dimensional tree. Then CHAINNET [21] was used to produce an alignment net. The resulting best-in-genome pairwise alignments were progressively aligned using MULTIZ [22], based on the phylogenetic tree [23], as Fig. (1) shows, to produce multiple alignments.
Only the multiple alignments of 19 mammalian species were studied. The triple alignments of human, chicken and  one of the other 18 mammalian species were used to assign the insertions and deletions to human or the other mammalian species by the parsimony principle, using chicken as outgroup. In this study, there were four events inferred as insertions or deletions (Fig. 2).
The probability of an insertion or deletion of length k was calculated by equation 1 where f k is the probability of the insertion or deletion with the gap length k, N k is the number of the insertion or deletion that has the gap length k. Then the power law can be defined as equation 2 [9].

Fig
. (3) shows the length distributions of the insertions and deletions of the 18 mammalian genomes. Deletions occur more frequently than insertions over all gap lengths. However, in opossum, insertions occur more frequently than deletions except the gap of length 2. The ratio of deletions to insertions varies from 0.85 to 12.82 ( Table 3). Only in the opossum the ratio is less than 1. In rabbit, the deletions are extremely more than insertions. The total lengths of deletions are larger than insertions, except for hedgehog, elephant, tenrec and opossum.
Both the number of insertions and deletions decrease rapidly with the increases of gap length. The single nucleo-tide insertion and deletion are the most frequent in all events. The percentage of single nucleotide insertions varies from 28.63% to 71.00%, and the percentage of single nucleotide deletions varies from 26.54% to 46.74% ( Table 3).
The probability of insertions and deletions, as a function of gap length, fits power law equation given above very well. Regression analysis of the data, using SPSS 15.0 [24], gave the values of a, b and R 2 ( Table 4). SPSS was also used to perform the Kolmogorov-Smirnov test for goodness-of-fit tailored to power law distributions. Table 4 shows the results of the test. Fig. (4) shows the plots of parameters k and f k for deletions. Fig. (5) shows the plots of k and f k for insertions.

DISCUSSION
Nucleotide substitution, insertion and deletion (indel) events are the major driving forces that have shaped genomes [9]. Furthermore, recent researches found that insertions and deletions, instead of substitutions, are the major path to the genomic divergence [1][2][3][4]. Therefore, the study of the patterns of insertion and deletion in the genomes is essentially important.  Fig. (3). Length distributions of insertions and deletions.  Previous studies found that there was preponderance of deletions over insertions [5][6][7][8][9][10][11][12][13]. From the extensive genome data used in this study, we have shown that deletions occur more frequently than insertions in genomes. Although insertions are more frequent than deletions in opossum, it is not significant. Therefore, deletions occur more frequently than insertions can be regarded as a general genomic feature.
Single nucleotide insertion and deletion are the most frequent in all events, and the frequency of insertions and deletions decrease quickly as the gap length increases. The high occurrence of single nucleotide gaps was also observed in the study of 22 human and 30 rodents processed pseudogenes [6], 78 human processed pseudogenes [7], 1726 human ribosomal protein pseudogene sequences [9], noncoding nucleotide sequences of primates [10], Escherichia coli [25], chloroplast noncoding nucleotide sequence of nine monocot plants [26]. Therefore, the high percent of single nucleotide insertion and deletion seems to be a common phenomenon in the genomic evolution. Benner et al. (1993) studied the alignments of homologous protein sequence pairs and concluded that the distribution of the gap length follows power law distribution [14]. Gu and Li (1995) aligned 78 human processed pseudogenes, the human functional genes and the reference, they found the size distributions of insertions and deletions fitted to power law very well [7]. Recently, Zhang and Gerstein (2003) examined the patterns of insertions and deletions in 1726 processed ribosomal protein pseudogenes and found that the frequencies of both insertions and deletions followed characteristic power law behavior associated with the length of the gaps [9]. In this study, the probability distributions of insertions and deletions in the 18 mammalian genomes can both be described by power law distribution. The results suggest that the gap penalty should be log-affine [27], i.e., g(k)=a+bk+clnk, where g(k) is the gap penalty for insertion or deletion, k is the length of the insertion or deletion. Fig. (4). f k vs k plotting for deletions.