|
How to Evaluate Your Question Answering System Every Day—and Still Get Real Work Done
April 2000
Eric J. Breck, The MITRE Corporation
John D. Burger, The MITRE Corporation
Lisa Ferro, The MITRE Corporation
Lynette Hirschman, The MITRE Corporation
David House, The MITRE Corporation
Marc Light, The MITRE Corporation
Inderjeet Mani, The MITRE Corporation
ABSTRACT
In this paper, we report on Qaviar, an experimental automated evaluation
system for question answering applications. The goal of our research
was to find an automatically calculated measure that correlates well
with human judges' assessment of answer correctness in the context of
question answering tasks. Qaviar judges the response by computing recall
against the stemmed content words in the human-generated answer key.
It counts the answer correct if it exceeds a given recall threshold.
We determined that the answer correctness predicted by Qaviar agreed
with the human 93% to 95% of the time. 41 question-answering systems
were ranked by both Qaviar and human assessors, and these rankings correlated
with a Kendall's Tau measure of 0.920, compared to a correlation
of 0.956 between human assessors on the same data.

Additional Search Keywords
N/A
|