Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 1995
May;51(5):5084-91.
Long-range correlation properties of coding and noncoding DNA sequences:
GenBank analysis.
Buldyrev SV, Goldberger AL, Havlin S, Mantegna RN, Matsa ME, Peng CK, Simons M,
Stanley HE.
Deparment of Physics, Boston University, Massachusetts 02215, USA.
An open question in computational molecular biology is whether long-range
correlations are present in both coding and noncoding DNA or only in the latter.
To answer this question, we consider all 33301 coding and all 29453 noncoding
eukaryotic sequences--each of length larger than 512 base pairs (bp)--in the
present release of the GenBank to dtermine whether there is any statistically
significant distinction in their long-range correlation properties. Standard
fast Fourier transform (FFT) analysis indicates that coding sequences have
practically no correlations in the range from 10 bp to 100 bp (spectral exponent
beta=0.00 +/- 0.04, where the uncertainty is two standard deviations). In
contrast, for noncoding sequences, the average value of the spectral exponent
beta is positive (0.16 +/- 0.05) which unambiguously shows the presence of
long-range correlations. We also separately analyze the 874 coding and the 1157
noncoding sequences that have more than 4096 bp and find a larger region of
power-law behavior. We calculate the probability that these two data sets
(coding and noncoding) were drawn from the same distribution and we find that it
is less than 10(-10). We obtain independent confirmation of these findings
using the method of detrended fluctuation analysis (DFA), which is designed to
treat sequences with statistical heterogeneity, such as DNA's known mosaic
structure ("patchiness") arising from the nonstationarity of nucleotide
concentration. The near-perfect agreement between the two independent analysis
methods, FFT and DFA, increases the confidence in the reliability of our
conclusion.
PMID: 9963221 [PubMed - indexed for MEDLINE]