DNA Sequence Analysis: The Genome Is More than the Sum of Its Parts

September 2011
Topics: Genetics, Genetic Engineering
DNA sequencing technology has made rapid strides in recent years, generating a deluge of data.
Strand of human DNA

Planks vs. Boat

One of the pioneers of microbial genomics, Antoine Danchin,recalls in his book Delphic Boat an ancient question originally posed to the oracle of Delphi. The question asked whether a boat whose planks have all been replaced one by one over time is still the same boat. Danchin echoes this question by asking whether it is the individual components of a gene or the relationships among them that define a gene. Should we attempt to deduce the biological function and the genetic family tree of an organism simply by comparing individual parts, or rather, should we set a more ambitious goal and try to guess the overall blueprint for the genomic construction?

This question seems to be of both fundamental and timely importance now that DNA sequencing figures so heavily into such important endeavors as DNA fingerprinting, pathogen detection, gene finding, genealogy study, and evolutionary tree reconstruction. These tasks all rely, in large part, on comparing an unfamiliar DNA sequence to one that is already known,establishing islands of similarity, and then forming a hypothesis about the function or meaning of the new sequence.

DNA Sequence Analysis: The Genome Is More Than The Sum of Its Parts

Graduating from Spelling

A DNA sequence is analogous to a sentence in English in which the letters correspond to four types of organic molecules,called bases. These bases are adenine (A), guanine (G),cytosine (C), and thymine (T). In DNA fingerprinting, for example, an unknown collection of DNA fragments, typically a few tens to thousands of bases long, is compared with one of several known collections of DNA fragments contained in a library. Either or both of these collections might be incomplete or unordered, or might contain errors, including symbol insertions and symbol deletions. Finding a match between collections establishes genome identity.

The analogy with the English language can be taken further. To determine the message contained in a text or the author of the text, we concentrate on the words in the text, what they mean, and how they fit together, rather than on the individual letters of the text. And even when we can't accurately read certain letters or sentence fragments, we can still decipher much of the meaning of the text.

It is therefore appropriate that in processing DNA sequence data, even data that is plagued with missing or corrupted information, we should similarly identify the key constituent components of genomes and investigate how they fit into the overall genomic design. Instead of comparing the letters of DNA sequences, we should compare the words and the sentences they are forming.

Speed Reading

The first step toward decoding a genome is to identify the atomic components of the genomic sequence.

MITRE researchers began this journey by marking the genomic atoms with short random strings associated with algebraic constructs. Then the team explored the utility of these strings in a DNA sequence homology evaluation, which means a sequence comparison performed by matching versions of the original sequences. Such evaluation permits us, at a reduced computational cost, to identify, from a large pool of data, those sequences that are in some way close to the query sequence and to perform a rough sequence alignment that provides an indication of the overall sequence similarity.

This procedure consists of two essentially algebraic computations: the construction of sequences and the alignment of these sequences by a fast implementation of the cross-correlation method. Replacing the analysis of DNA sequences with an analysis of homology marker sequences significantly reduces the number of computations required. This reduction is proportional to the ratio of lengths of these sequences. MITRE designed this method motivated by its low computational complexity and its lack of limitations on the DNA sequence size it can analyze.

MITRE's research into the mathematics of DNA sequence analysis is far from a mere academic exercise. Future progress in molecular biology will bear directly on the development of effective strategies for mitigating such threats as bio-terrorism and bio-warfare. And the rate of that progress will ultimately depend on an effective merger of mathematical and life sciences.

—by Andrzej Brodzik


Publication Search