Adaptive Web-page Content
Identification
July 2007
John Gibson, The MITRE Corporation
Ben Wellner, The MITRE Corporation
Susan Lubar, The MITRE Corporation
ABSTRACT
Identifying which parts of a Web-page contain target content (e.g.,
the portion of an online news page that contains the actual article)
is a significant problem that must be addressed for many Webbased
applications. Most approaches to this problem involve
crafting hand-tailored rules or scripts to extract the content,
customized separately for particular Web sites. Besides requiring
considerable time and effort to implement, hand-built extraction
routines are brittle: they fail to properly extract content in some
cases and break when the structure of a site's Web-pages changes.
In this work we treat the problem of identifying content as a
sequence labeling problem, a common problem structure in
machine learning and natural language processing. Using a
Conditional Random Field sequence labeling model, we correctly
identify the content portion of web-pages anywhere from 80-97%
of the time depending on experimental factors such as ensuring
the absence of duplicate documents and application of the model
against unseen sources.

Additional Search Keywords
Conditional random fields, content identification, maximum
entropy markov models, sequence labeling
|