ИСТИНА |
Войти в систему Регистрация |
|
ИСТИНА ИНХС РАН |
||
One of the most obvious tasks in bioinformatics is the analysis of relative representation of different nucleotide sequences (words) starting from single letter (GC content) to di-, tri-, tetranucleotides and longer within and across genomic sequences. With the development of reliable sequencing techniques and accumulation of sequence data, important observations on DNA content were made. As genomic databases are rapidly growing it is necessary to update these observations and spread them on to a wide range of species. To serve this purpose we analyzed DNA content and relative representation of nucleotide words 1-15 letters long for over a hundred eukaryotic species. A method suggested by Karlin and Ladunga [1] was used to estimate under- and overrepresentation of words in each genome. We found that some words that were considered to be universally over- or underrepresented show considerable exceptions while some other words appear to show a more universal trend. On the other hand we studied word content in the human genome in more detail and compared different types of sequences like coding, non-coding, or masked for different repeats. The most known underrepresented two letter word is CpG. It is accepted that the activity of a CpG specific methyltransferase increases the mutation rate from CpG to TpG. Indeed, CpG deficit is universal for all studied viridiplantae and most metazoa. In metazoa some species of insects and nematodes show no CpG deficit or even show overrepresentation of this dinucleotide. In nematodes this correlates with the loss of methyltransferase in some species, yet in insects the situation is more complicated: in honey bee active methyltransferases coexist with CpG overrepresentation. Studied fungi also demonstrate a diverse spectrum of CpG representation. The next most universally underrepresented dinucleotide is TpA, which is underrepresented in all studied genomes, except for Plasmodium falciparum. The tendency for T and A nucleotides to form long homogeneous stretches (AA…A, TT…T) may contribute to this effect because expected TpA frequencies do not take this effect into account. Indeed, TpA underrepresentation is negatively correlated with ApA(TpT) overrepresentation. This explanation would expect the same behavior of ApT sequence. Nevertheless, TpA is underrepresented comparing to ApT in all species. ApC and GpT are underrepresented in most analyzed genomes. Exceptions are three chordate groups Lamprey, Lancelet and Ciona. We do not know of any statistical reasons for the underrepresentation of ApC(GpT). An actual biological mechanism appears to be responsible for CpG deficit. It is interesting if any specific mechanism would result in ApC(GpT) or TpA deficit. We have also used the advantage of the human SNP database to determine the trends of nucleotide substitution in words of different length. To determine the direction of single nucleotide mutations in the human genome we selected SNPs that were mapped to a corresponding Pan troglodytes genomic region. We compared alleles of the human SNP with the chimp variant to determine the direction of the mutation: the allele matching the chimp variant was concidered to be ancestral. For instance, our analysis has shown, that the probability rate of (G or С) to (A or T) mutations is larger than the probability rate of (A or T) to (G or C) mutations. When these rates are taken into account, the equilibrium nucleotide content in the Human genome turns out to be 38,6% (G+C) and 61,4% (A+T). The current nucleotide content of the Human genome is 42,1% (G+C) and 57,9% (A+T). The nucleotide content of the human genome has not reached equilibrium and further decrease in (G+C) is predicted in the future. Acknowledgements: supported by RFBR 08-04-00478, 08-04-91975 and MCB RAS 1. S.Karlin, I.Ladunga (1994) Comparisons of eukaryotic genomic sequences, Proc Nat Acad Sci, 91:12832-12836.