![]() |
|||||
|
|
|
|
||||
Linking Biological Literature, Information, and Knowledge Lynette Hirschman, Alex Yeh, Alex Morgan, and Marc Colosimo
he field of biology has undergone a profound revolution over the past 15 years in response to the successful sequencing of the human genome. This revolution has fostered rapid development of the biotechnology industry to support the sequencing of additional organisms. Whereas with the human genome, the bottleneck was the actual sequencing of the genome, the current bottleneck is largely in the interpretation of the sequences. It is as if someone had finally invented a way of accurately transcribing millions of texts in an unknown script; now the challenge is to understand what these texts say. Thus molecular biology has become an information science as well as an experimental science. To tackle the understanding of the genome, biologists have been generating reams of biomedical literature and building more and more biological databases. The literature comprises accumulated knowledge: the archival record of biological experiments, their methods, their results, and interpretation of those results. The explosion of literature makes it nearly impossible for a working biologist to keep up with developments in one field, let alone relevant work across organisms or on related genes or proteins. Biologists have responded to the need to organize all this information by developing and maintaining specialized databases, which they make accessible to the research community. Most of these databases were originally designed by individual researchers for their own searches. However, other biologists are increasingly tapping into them to do systematic sequence searches for further processing by sophisticated programs. Once supplied with the correct data, these programs can find genes in the sequenced DNA, predict the protein folding that gives each protein its specific function, and even predict the function of new proteins. There are now hundreds of databases covering specific model organisms (e.g., fly, mouse, worm), classes of proteins, binding sites, microarray experiments, pathogens, and many other subjects. To help deal with the challenge of managing so much information, MITRE is contributing its knowledge of information management and data mining to the biology community. For example, we're developing tools to help researchers maintain their databases effectively through text mining of the biological literature. We're also leading the way in the systematic evaluation of the state of the art for text mining of this literature. Managing Growth The ever-growing biological databases are becoming increasingly expensive and difficult to maintain. Each model organism database, for example, is maintained by a team of specialized biologists ("curators") who track the literature and transfer relevant new findings into appropriate database entries in a process called "data curation." Typically, these databases lag behind the literature because the curators have difficulty keeping up with the flow of information. The curators need interactive tools to:
Biologists rely heavily on the information in biological databases and
often need to compare information among them. For this reason, databases
must use the same terms to describe the same item. To represent information
uniformly across different organisms, curators are increasingly making
use of ontologies to provide a set of "computable" semantic categories.
For example, the most widely used biological ontology, the Gene
Ontology, provides three hierarchies to describe biological process,
molecular function, and subcellular localization in terms of "is_a" and
"has_part" relations. There are now databases for more than 15 organisms
that annotate genes using the Gene Ontology. When we began our efforts a few years ago, a number of groups were claiming to be able to extract information from the biological literature; however, no two groups had evaluated their systems on the same data sets, so no results were comparable. Also, none of these research systems had been applied to real applications. Thus it was impossible to tell whether text mining for biology was ready for application to biological problems—or whether more research was needed. To assess the state of the art, we began by organizing a Challenge Cup for the Association of Computing Machinery Knowledge Discovery and Data Mining Conference. The "challenge" was to create a system that was able to identify articles containing experimental evidence on specific genes and gene products. We worked with FlyBase (the Drosophila genome database) to provide sample data-sets of articles for both system training and evaluation of system performance on new ("blind") test data. The successful event attracted 18 systems from eight countries fielded by teams, most of which consisted of both biologists and computer scientists. Overall, the results were promising: the winning system was able to return documents with the relevant information 78 percent of the time. However, the technology will need further refinement before it can truly help the database curators in their work. For our second evaluation activity, we organized BioCreAtIvE: Critical Assessment of Information Extraction for Biology, held in Spain in March 2004. Our goal was to assess the state of the art for text mining systems applied to real biological problems, focusing again on curation aids for biological databases. The assessment, in which 27 groups participated, focused on two tasks:
For the first task, the best systems were able to extract general gene names from sentences of MEDLINE abstracts with more than 80 percent accuracy and map the gene names to unique database identifiers. These results may be good enough to provide tools to help database curators. The second task proved to be more difficult since it required finding mentions of proteins and then extracting the function or location of the protein from the information provided in the text. Systems were able to achieve only 10 to 20 percent accuracy on this more challenging task. We are now actively exploring how to improve performance on both of these tasks. We are currently creating a generalized system to locate and normalize gene names, given a lexicon of gene names and identifiers for a particular organism. We are also working on improved information extraction for complex biological information (such as assignment of Gene Ontology codes) by providing a much richer lexicon of terms and phrases for biological functions and processes. These tools would have wide applicability beyond biology. They could be used to identify geographic names, company names, and even names of "people of interest" in running text. In today's Information Age, the challenge is increasingly becoming not the lack of information but the excess of it. Data are most valuable when they can be easily and quickly accessed, digested, and understood. MITRE is working on the information management challenges of many of our sponsors—helping to channel the flow of information so that it reaches its intended destinations, but without flooding them. |
|
||||
| For more information, please contact Lynette Hirschman, Alex Yeh, Alex Morgan, or Marc Colosimo using the employee directory. Page last updated: May 24, 2005 | Top of page |
|||||
Solutions That Make a Difference.® |
|
|