Dropped Personal Pronoun Recovery in Chinese SMSJuly 2016
Topics: Computing Methodologies, Social and Behavioral Sciences
In written Chinese, personal pronouns are commonly dropped when they can be inferred from context. This practice is particularly common in informal genres like Short Message Service (SMS) messages sent via cell phones. Restoring dropped personal pronouns can be a useful preprocessing step for information extraction. Dropped personal pronoun recovery can be divided into two subtasks: (1) detecting dropped personal pronoun slots and (2) determining the identity of the pronoun for each slot. We address a simpler version of restoring dropped personal pronouns wherein only the person numbers are identified. After applying a word segmenter, we used a linear-chain conditional random field (CRF) to predict which words were at the start of an independent clause. Then, using the independent clause start information, as well as lexical and syntactic information, we applied a CRF or a maximum-entropy classifier to predict whether a dropped personal pronoun immediately preceded each word and, if so, the person number of the dropped pronoun. We conducted a series of experiments using a manually annotated corpus of Chinese SMS messages. Our machine-learning–based approaches substantially outperformed a rule-based approach based partially on rules developed by Chung and Gildea in 2010. Features derived from parsing largely did not help our approaches. We conclude that the parse information is largely superfluous for identifying dropped personal pronouns if independent clause start information is available.