Adaptive Web-page Content
Identification
July 2007
John Gibson, The MITRE Corporation
Ben Wellner, The MITRE Corporation
Susan Lubar, The MITRE Corporation
ABSTRACT
Identifying which parts of a Web-page contain target content (e.g.,
the portion of an online news page that contains the actual article)
is a significant problem that must be addressed for many Webbased
applications. Most approaches to this problem involve
crafting hand-tailored rules or scripts to extract the content,
customized separately for particular Web sites. Besides requiring
considerable time and effort to implement, hand-built extraction
routines are brittle: they fail to properly extract content in some
cases and break when the structure of a site's Web-pages changes.
In this work we treat the problem of identifying content as a
sequence labeling problem, a common problem structure in
machine learning and natural language processing. Using a
Conditional Random Field sequence labeling model, we correctly
identify the content portion of web-pages anywhere from 80-97%
of the time depending on experimental factors such as ensuring
the absence of duplicate documents and application of the model
against unseen sources.
» Download Paper [PDF, 548KB]
Additional Search Keywords
Conditional random fields, content identification, maximum
entropy markov models, sequence labeling
|