About Us Our Work Employment News & Events
MITRE Remote Access for MITRE Staff and Partners Site Map
Our Work

Follow Us:

Visit MITRE on Facebook
Visit MITRE on Twitter
Visit MITRE on Linkedin
Visit MITRE on YouTube
View MITRE's RSS Feeds
View MITRE's Mobile Apps
Home > Our Work > Technical Papers >

Adaptive Web-page Content Identification

July 2007

John Gibson, The MITRE Corporation
Ben Wellner, The MITRE Corporation
Susan Lubar, The MITRE Corporation

ABSTRACT

Identifying which parts of a Web-page contain target content (e.g., the portion of an online news page that contains the actual article) is a significant problem that must be addressed for many Webbased applications. Most approaches to this problem involve crafting hand-tailored rules or scripts to extract the content, customized separately for particular Web sites. Besides requiring considerable time and effort to implement, hand-built extraction routines are brittle: they fail to properly extract content in some cases and break when the structure of a site's Web-pages changes. In this work we treat the problem of identifying content as a sequence labeling problem, a common problem structure in machine learning and natural language processing. Using a Conditional Random Field sequence labeling model, we correctly identify the content portion of web-pages anywhere from 80-97% of the time depending on experimental factors such as ensuring the absence of duplicate documents and application of the model against unseen sources.

View/Download Document

Additional Search Keywords

Conditional random fields, content identification, maximum entropy markov models, sequence labeling

 

Page last updated: July 31, 2007   |   Top of page

Homeland Security Center Center for Enterprise Modernization Command, Control, Communications and Intelligence Center Center for Advanced Aviation System Development

 
 
 

Solutions That Make a Difference.®
Copyright © 1997-2012, The MITRE Corporation. All rights reserved.
MITRE is a registered trademark of The MITRE Corporation.
Material on this site may be copied and distributed with permission only.

IDG's Computerworld Names MITRE a "Best Place to Work in IT" for Seventh Straight Year MITRE Named to Boston Globe's 2011 Top Places to Work List Fast Company Names MITRE One of the "World's 50 Most Innovative Companies"
 

Privacy Policy | Contact Us