Robust Language Identification in Short, Noisy Texts: Improvements to LIGA
July 2012
John Vogel, Brandeis University
David Tresner-Kirsch, Brandeis University and The MITRE Corporation
ABSTRACT
Language identification (LI) is a crucial preprocessing step for natural language processing tasks and other secondary uses of documents from multilingual sources. Conventional machine learning approaches to LI perform very well on long documents using standard language, but relatively poorly on short documents rife with non-standard orthography. This limitation is an obstacle to secondary uses of social media data from Twitter and other similar sources. We propose several linguistically-motivated modifications to the LIGA algorithm, and evaluate these modifications empirically. Our results show that a modified algorithm achieves 99.8% accuracy disambiguating among six European languages, reducing baseline LIGA's error rate by roughly an order of magnitude.

Additional Search Keywords
Twitter, microblogs, language identification, natural language processing, machine learning
|