Statistical Analysis of DNA Sequences Using Overlapping Windows
Amy Hauth, Murray K. Clayton
2000
Motivation: Our analysis of DNA sequences uses a klength, sliding window and considers all overlapping windows along the sequence. The k consecutive nucleotides in a window are called a word or kword. Statistical analysis of this collection of words often assumes independence between words. Since words can overlap, strict independence is not a valid assumption. We derive a statistic to incorporate both the independent and dependent components of overlapping, klength words. Results: The expected number of occurrences for a kword in an Nlength sequence is easily calculated given the probabilities of the nucleotides within the word. However, the variance is not straightforward since overlapping occurrences are not independent. We present a derivation of the variance when sequence analysis uses overlapping, klength windows. The variance can be determined for a word in the entire sequence or at a single position in the sequence. Our analysis assumes that each nucleotide is independent. It does not assume a specific probability of occurrence for each nucleotide.
