|
Reliable Electronic Text: The Elusive Prerequisite for a Host of Human Language Technologies
April 2011
Paul M. Herceg, The MITRE Corporation
Catherine N. Ball, The MITRE Corporation
ABSTRACT
Electronic text for use by human language technologies originates from a number of sources: direct keyboard entry, optical character recognition, speech recognition, and text-containing computer files. In particular, text-containing computer files may elude processing by an array of human language technology applications (e.g., search, language ID, machine translation, and text
analytics). This paper brings to light the effort required to extract electronic text from these files, preserve its integrity, and, for some use cases, preserve its structure. It explores a series of specific human language technologies, highlighting the following aspects for each: relevant use cases, the impact of text extraction or conversion errors, the criticality of dependable text extraction and reliable electronic text, and the importance of experimentation and/or testing prior to use. Overall, this paper promotes the successful use of human language technology by equipping the reader to be discerning about the use of human language technology applications with text-containing files.

Additional Search Keywords
Electronic text, human language technology, computer files, search, language ID, machine translation, text analytics, content extraction
|