About Us Our Work Employment News & Events
MITRE Remote Access for MITRE Staff and Partners Site Map

Home > News & Events > MITRE Publications > The Edge >

Linking Biological Literature, Information, and Knowledge

Lynette Hirschman, Alex Yeh, Alex Morgan, and Marc Colosimo

he field of biology has undergone a profound revolution over the past 15 years in response to the successful sequencing of the human genome. This revolution has fostered rapid development of the biotechnology industry to support the sequencing of additional organisms.

Whereas with the human genome, the bottleneck was the actual sequencing of the genome, the current bottleneck is largely in the interpretation of the sequences. It is as if someone had finally invented a way of accurately transcribing millions of texts in an unknown script; now the challenge is to understand what these texts say. Thus molecular biology has become an information science as well as an experimental science.

To tackle the understanding of the genome, biologists have been generating reams of biomedical literature and building more and more biological databases. The literature comprises accumulated knowledge: the archival record of biological experiments, their methods, their results, and interpretation of those results. The explosion of literature makes it nearly impossible for a working biologist to keep up with developments in one field, let alone relevant work across organisms or on related genes or proteins.

Biologists have responded to the need to organize all this information by developing and maintaining specialized databases, which they make accessible to the research community. Most of these databases were originally designed by individual researchers for their own searches. However, other biologists are increasingly tapping into them to do systematic sequence searches for further processing by sophisticated programs. Once supplied with the correct data, these programs can find genes in the sequenced DNA, predict the protein folding that gives each protein its specific function, and even predict the function of new proteins. There are now hundreds of databases covering specific model organisms (e.g., fly, mouse, worm), classes of proteins, binding sites, microarray experiments, pathogens, and many other subjects.

To help deal with the challenge of managing so much information, MITRE is contributing its knowledge of information management and data mining to the biology community. For example, we're developing tools to help researchers maintain their databases effectively through text mining of the biological literature. We're also leading the way in the systematic evaluation of the state of the art for text mining of this literature.

Managing Growth

The ever-growing biological databases are becoming increasingly expensive and difficult to maintain. Each model organism database, for example, is maintained by a team of specialized biologists ("curators") who track the literature and transfer relevant new findings into appropriate database entries in a process called "data curation." Typically, these databases lag behind the literature because the curators have difficulty keeping up with the flow of information. The curators need interactive tools to:

  • Help in the timely and consistent transfer of information from the literature into the databases
  • Identify/prioritize/add new articles
  • Identify which genes and/or proteins have experimental findings associated with them
  • Associate function information for genes and proteins
  • Identify which genes and/or proteins interact to regulate various functions.

Biologists rely heavily on the information in biological databases and often need to compare information among them. For this reason, databases must use the same terms to describe the same item. To represent information uniformly across different organisms, curators are increasingly making use of ontologies to provide a set of "computable" semantic categories. For example, the most widely used biological ontology, the Gene Ontology, provides three hierarchies to describe biological process, molecular function, and subcellular localization in terms of "is_a" and "has_part" relations. There are now databases for more than 15 organisms that annotate genes using the Gene Ontology.

Our main focus is on tools that will improve the currency, consistency, and completeness of biological databases. To find out what is needed, we have identified key partners to guide our research priorities and provide feedback on the use of the tools we develop. We have also involved these partners in setting up critical assessments of the state of the art in text mining technology.

When we began our efforts a few years ago, a number of groups were claiming to be able to extract information from the biological literature; however, no two groups had evaluated their systems on the same data sets, so no results were comparable. Also, none of these research systems had been applied to real applications. Thus it was impossible to tell whether text mining for biology was ready for application to biological problems—or whether more research was needed.

To assess the state of the art, we began by organizing a Challenge Cup for the Association of Computing Machinery Knowledge Discovery and Data Mining Conference. The "challenge" was to create a system that was able to identify articles containing experimental evidence on specific genes and gene products. We worked with FlyBase (the Drosophila genome database) to provide sample data-sets of articles for both system training and evaluation of system performance on new ("blind") test data. The successful event attracted 18 systems from eight countries fielded by teams, most of which consisted of both biologists and computer scientists. Overall, the results were promising: the winning system was able to return documents with the relevant information 78 percent of the time. However, the technology will need further refinement before it can truly help the database curators in their work.

For our second evaluation activity, we organized BioCreAtIvE: Critical Assessment of Information Extraction for Biology, held in Spain in March 2004. Our goal was to assess the state of the art for text mining systems applied to real biological problems, focusing again on curation aids for biological databases. The assessment, in which 27 groups participated, focused on two tasks:

  1. Extracting gene or protein names from text and converting these mentions into standardized gene identifiers for inclusion into three model organism databases.
  2. Extracting Gene Ontology annotations from text for protein function, biological process, and localizations. (This task was put together by our collaborators at CNB-CSIC, Autonomous University of Madrid.)

For the first task, the best systems were able to extract general gene names from sentences of MEDLINE abstracts with more than 80 percent accuracy and map the gene names to unique database identifiers. These results may be good enough to provide tools to help database curators. The second task proved to be more difficult since it required finding mentions of proteins and then extracting the function or location of the protein from the information provided in the text. Systems were able to achieve only 10 to 20 percent accuracy on this more challenging task.

We are now actively exploring how to improve performance on both of these tasks. We are currently creating a generalized system to locate and normalize gene names, given a lexicon of gene names and identifiers for a particular organism. We are also working on improved information extraction for complex biological information (such as assignment of Gene Ontology codes) by providing a much richer lexicon of terms and phrases for biological functions and processes. These tools would have wide applicability beyond biology. They could be used to identify geographic names, company names, and even names of "people of interest" in running text.

In today's Information Age, the challenge is increasingly becoming not the lack of information but the excess of it. Data are most valuable when they can be easily and quickly accessed, digested, and understood. MITRE is working on the information management challenges of many of our sponsors—helping to channel the flow of information so that it reaches its intended destinations, but without flooding them.

 

For more information, please contact Lynette Hirschman, Alex Yeh, Alex Morgan, or Marc Colosimo using the employee directory.


Page last updated: May 24, 2005   |   Top of page

Homeland Security Center Center for Enterprise Modernization Command, Control, Communications and Intelligence Center Center for Advanced Aviation System Development

 
 
 

Solutions That Make a Difference.®
Copyright © 1997-2013, The MITRE Corporation. All rights reserved.
MITRE is a registered trademark of The MITRE Corporation.
Material on this site may be copied and distributed with permission only.

IDG's Computerworld Names MITRE a "Best Place to Work in IT" for Eighth Straight Year The Boston Globe Ranks MITRE Number 6 Top Place to Work Fast Company Names MITRE One of the "World's 50 Most Innovative Companies"
 

Privacy Policy | Contact Us