Improving collocation extraction for high frequency words

David Wible, Chin Hwa Kuo, Nai Lung Tsao

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The purpose of this paper is to introduce an alternative word association measure aimed at addressing the under-extraction collocations that contain high frequency words. While measures such as MI provide the important contribution of filtering out sheer high frequency of words in the detection of collocations in large corpora, one side effect of this filtering is that it becomes correspondingly difficult for such measures to detect true collocations involving high frequency words. As an alternative, we propose normalizing the MI measure by dividing the frequency of a candidate lexeme by the number of senses of that lexeme. We premise this alternative approach on the one sense per collocation assumption of Yarowsky (1992; 1995). Ten verb-noun collocations involving three high frequency verbs (make, take, run) are used to compare the extraction results of traditional MI and the proposed normalized MI. Results show the ranking of these high-frequency verbs as candidate collocates with the target focal nouns is raised by normalizing MI as proposed. Side effects of these improved rankings are discussed, such as increase in false positives resulting from higher recall. It is found that overall rank precision remains quite stable even with the increased recall of normalized MI.

Original languageEnglish
Title of host publicationProceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004
EditorsMaria Francisca Xavier, Rute Costa, Fatima Ferreira, Maria Teresa Lino, Raquel Silva
PublisherEuropean Language Resources Association (ELRA)
Pages1855-1858
Number of pages4
ISBN (Electronic)2951740816, 9782951740815
StatePublished - 2004
Event4th International Conference on Language Resources and Evaluation, LREC 2004 - Lisbon, Portugal
Duration: 26 May 200428 May 2004

Publication series

NameProceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004

Conference

Conference4th International Conference on Language Resources and Evaluation, LREC 2004
Country/TerritoryPortugal
CityLisbon
Period26/05/0428/05/04

Fingerprint

Dive into the research topics of 'Improving collocation extraction for high frequency words'. Together they form a unique fingerprint.

Cite this