About Us Our Work Employment News & Events
MITRE Remote Access for MITRE Employees Site Map
Our Work
Share this page

Follow Us On:

Visit MITRE on Facebook
Visit MITRE on Twitter
Visit MITRE on YouTube
View MITRE's RSS Feeds
Home > Our Work > Technical Papers >

Adaptive Web-page Content Identification

July 2007

John Gibson, The MITRE Corporation
Ben Wellner, The MITRE Corporation
Susan Lubar, The MITRE Corporation

ABSTRACT

Identifying which parts of a Web-page contain target content (e.g., the portion of an online news page that contains the actual article) is a significant problem that must be addressed for many Webbased applications. Most approaches to this problem involve crafting hand-tailored rules or scripts to extract the content, customized separately for particular Web sites. Besides requiring considerable time and effort to implement, hand-built extraction routines are brittle: they fail to properly extract content in some cases and break when the structure of a site's Web-pages changes. In this work we treat the problem of identifying content as a sequence labeling problem, a common problem structure in machine learning and natural language processing. Using a Conditional Random Field sequence labeling model, we correctly identify the content portion of web-pages anywhere from 80-97% of the time depending on experimental factors such as ensuring the absence of duplicate documents and application of the model against unseen sources.

» Download Paper [PDF, 548KB]

Additional Search Keywords

Conditional random fields, content identification, maximum entropy markov models, sequence labeling

 

Page last updated: July 31, 2007   |   Top of page

Homeland Security Center Center for Enterprise Modernization Command, Control, Communications and Intelligence Center Center for Advanced Aviation System Development

 
 
 

Serving as Architects of Information Advantage.™
Copyright © 1997-2009, The MITRE Corporation. All rights reserved.
MITRE is a registered trademark of The MITRE Corporation.
Material on this site may be copied and distributed with permission only.

MITRE Named to FORTUNE's "100 Best Companies to Work For" List for Eighth Straight Year MITRE Named to "Best Places to Work in IT" List for Fifth Consecutive Year
 

Privacy Policy | Contact Us