MITRE
 
About Us Our Work Employment News & Events
MITRE Remote Access for MITRE Employees Site Map
Home > Our Work > Technical Papers >

Adaptive Web-page Content Identification

July 2007

John Gibson, The MITRE Corporation
Ben Wellner, The MITRE Corporation
Susan Lubar, The MITRE Corporation

ABSTRACT

Identifying which parts of a Web-page contain target content (e.g., the portion of an online news page that contains the actual article) is a significant problem that must be addressed for many Webbased applications. Most approaches to this problem involve crafting hand-tailored rules or scripts to extract the content, customized separately for particular Web sites. Besides requiring considerable time and effort to implement, hand-built extraction routines are brittle: they fail to properly extract content in some cases and break when the structure of a site's Web-pages changes. In this work we treat the problem of identifying content as a sequence labeling problem, a common problem structure in machine learning and natural language processing. Using a Conditional Random Field sequence labeling model, we correctly identify the content portion of web-pages anywhere from 80-97% of the time depending on experimental factors such as ensuring the absence of duplicate documents and application of the model against unseen sources.

» Download Paper [PDF, 548KB]

Additional Search Keywords

Conditional random fields, content identification, maximum entropy markov models, sequence labeling

 

Page last updated: July 31, 2007   |   Top of page

Homeland Security Center Center for Enterprise Modernization Command, Control, Communications and Intelligence Center Center for Advanced Aviation System Development

 
 
 

Serving as Architects of Information Advantage.™
Copyright © 1997-2008, The MITRE Corporation. All rights reserved.
MITRE is a registered trademark of The MITRE Corporation.
Material on this site may be copied and distributed with permission only.

 

Privacy Policy | Contact Us

Boston Business Journal Best Places to Work 2007 Computerworld Best Places to Work in IT 2005-2007 Fortune 100 Best Places to Work 2002-2008