TY - JOUR
T1 - Task reformulation and data-centric approach for Twitter medication name extraction
AU - Zhang, Yu
AU - Lee, Jong Kang
AU - Han, Jen Chieh
AU - Tsai, Richard Tzong Han
N1 - Publisher Copyright:
© 2022 The Author(s). Published by Oxford University Press.
PY - 2022
Y1 - 2022
N2 - Automatically extracting medication names from tweets is challenging in the real world. There are many tweets; however, only a small proportion mentions medications. Thus, datasets are usually highly imbalanced. Moreover, the length of tweets is very short, which makes it hard to recognize medication names from the limited context. This paper proposes a data-centric approach for extracting medications in the BioCreative VII Track 3 (Automatic Extraction of Medication Names in Tweets). Our approach formulates the sequence labeling problem as text entailment and question-answer tasks. As a result, without using the dictionary and ensemble method, our single model achieved a Strict F1 of 0.77 (the official baseline system is 0.758, and the average performance of participants is 0.696). Moreover, combining the dictionary filtering and ensemble method achieved a Strict F1 of 0.804 and had the highest performance for all participants. Furthermore, domain-specific and task-specific pretrained language models, as well as data-centric approaches, are proposed for further improvements.
AB - Automatically extracting medication names from tweets is challenging in the real world. There are many tweets; however, only a small proportion mentions medications. Thus, datasets are usually highly imbalanced. Moreover, the length of tweets is very short, which makes it hard to recognize medication names from the limited context. This paper proposes a data-centric approach for extracting medications in the BioCreative VII Track 3 (Automatic Extraction of Medication Names in Tweets). Our approach formulates the sequence labeling problem as text entailment and question-answer tasks. As a result, without using the dictionary and ensemble method, our single model achieved a Strict F1 of 0.77 (the official baseline system is 0.758, and the average performance of participants is 0.696). Moreover, combining the dictionary filtering and ensemble method achieved a Strict F1 of 0.804 and had the highest performance for all participants. Furthermore, domain-specific and task-specific pretrained language models, as well as data-centric approaches, are proposed for further improvements.
UR - http://www.scopus.com/inward/record.url?scp=85137124259&partnerID=8YFLogxK
U2 - 10.1093/database/baac067
DO - 10.1093/database/baac067
M3 - 期刊論文
C2 - 35998105
AN - SCOPUS:85137124259
SN - 1758-0463
VL - 2022
JO - Database : the journal of biological databases and curation
JF - Database : the journal of biological databases and curation
M1 - baac067
ER -