Improving collocation extraction for high frequency words

David Wible, Chin Hwa Kuo, Nai Lung Tsao

研究成果: 書貢獻/報告類型會議論文篇章同行評審

摘要

The purpose of this paper is to introduce an alternative word association measure aimed at addressing the under-extraction collocations that contain high frequency words. While measures such as MI provide the important contribution of filtering out sheer high frequency of words in the detection of collocations in large corpora, one side effect of this filtering is that it becomes correspondingly difficult for such measures to detect true collocations involving high frequency words. As an alternative, we propose normalizing the MI measure by dividing the frequency of a candidate lexeme by the number of senses of that lexeme. We premise this alternative approach on the one sense per collocation assumption of Yarowsky (1992; 1995). Ten verb-noun collocations involving three high frequency verbs (make, take, run) are used to compare the extraction results of traditional MI and the proposed normalized MI. Results show the ranking of these high-frequency verbs as candidate collocates with the target focal nouns is raised by normalizing MI as proposed. Side effects of these improved rankings are discussed, such as increase in false positives resulting from higher recall. It is found that overall rank precision remains quite stable even with the increased recall of normalized MI.

原文???core.languages.en_GB???
主出版物標題Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004
編輯Maria Francisca Xavier, Rute Costa, Fatima Ferreira, Maria Teresa Lino, Raquel Silva
發行者European Language Resources Association (ELRA)
頁面1855-1858
頁數4
ISBN(電子)2951740816, 9782951740815
出版狀態已出版 - 2004
事件4th International Conference on Language Resources and Evaluation, LREC 2004 - Lisbon, Portugal
持續時間: 26 5月 200428 5月 2004

出版系列

名字Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004

???event.eventtypes.event.conference???

???event.eventtypes.event.conference???4th International Conference on Language Resources and Evaluation, LREC 2004
國家/地區Portugal
城市Lisbon
期間26/05/0428/05/04

指紋

深入研究「Improving collocation extraction for high frequency words」主題。共同形成了獨特的指紋。

引用此