Inter-rater Agreement Measures and the Refinement of Metrics in the PLATO MT Evaluation Paradigm
July 2005
Keith J. Miller, The MITRE Corporation
Michelle Vanni, U.S. Army Research Laboratory
ABSTRACT
The PLATO machine translation (MT) evaluation (MTE) research program
has as a goal the systematic development of a predictive relationship
between discrete, well-defined MTE metrics and the specific information
processing tasks that can be reliably performed with MT output. Traditional
measures of quality, informed by International Standards for Language
Engineering (ISLE), namely, clarity, coherence, morphology, syntax,
general and domain-specific lexical robustness, and named-entity translation,
as well as a DARPA-inspired measure of adequacy are at the core of the
program. For robust validation, indispensable for refinement of test
and guidelines, we conduct tests of inter-rater reliability on the assessments.
Here we discuss development and report on results of our inter-rater
reliability tests, focusing on results for Clarity and the Coherence,
the first two assessments in the PLATO suite, and we discuss our method
for iteratively refining our linguistic metrics and the guidelines for
applying them within the PLATO evaluation paradigm. Finally, we discuss
reasons why kappa might not be the best measure of inter-rater agreement
for our purposes, and suggest directions for future investigation.

Additional Search Keywords
N/A
|