TY - JOUR
T1 - Cross-language article linking with deep neural network based paragraph encoding
AU - Wang, Yu Chun
AU - Chuang, Chia Min
AU - Wu, Chun Kai
AU - Pan, Chao Lin
AU - Tsai, Richard Tzong Han
N1 - Publisher Copyright:
© 2021
PY - 2022/3
Y1 - 2022/3
N2 - Cross-language article linking (CLAL), the task of generating links between articles in different languages from different encyclopedias, is critical for facilitating sharing among online knowledge bases. Some previous CLAL research has been done on creating links among Wikipedia wikis, but much of this work depends heavily on simple language patterns and encyclopedia format or metadata. In this paper, we propose a new CLAL method based on deep learning paragraph embeddings to link English Wikipedia articles with articles in Baidu Baike, the most popular online encyclopedia in mainland China. To measure article similarity for link prediction, we employ several neural networks with attention mechanisms, such as CNN and LSTM, to train paragraph encoders that create vector representations of the articles’ semantics based only on article text, rather than link structure, as input data. Using our “Deep CLAL” method, we compile a data set consisting of Baidu Baike entries and corresponding English Wikipedia entries. Our approach does not rely on linguistic or structural features and can be easily applied to other language pairs by using pre-trained word embeddings, regardless of whether the two languages are on the same encyclopedia platform.
AB - Cross-language article linking (CLAL), the task of generating links between articles in different languages from different encyclopedias, is critical for facilitating sharing among online knowledge bases. Some previous CLAL research has been done on creating links among Wikipedia wikis, but much of this work depends heavily on simple language patterns and encyclopedia format or metadata. In this paper, we propose a new CLAL method based on deep learning paragraph embeddings to link English Wikipedia articles with articles in Baidu Baike, the most popular online encyclopedia in mainland China. To measure article similarity for link prediction, we employ several neural networks with attention mechanisms, such as CNN and LSTM, to train paragraph encoders that create vector representations of the articles’ semantics based only on article text, rather than link structure, as input data. Using our “Deep CLAL” method, we compile a data set consisting of Baidu Baike entries and corresponding English Wikipedia entries. Our approach does not rely on linguistic or structural features and can be easily applied to other language pairs by using pre-trained word embeddings, regardless of whether the two languages are on the same encyclopedia platform.
KW - Convolutional neural network
KW - Cross-language article linking
KW - Deep learning
KW - Link discovery
KW - Long short-term memory
KW - Paragraph encoding
UR - http://www.scopus.com/inward/record.url?scp=85115025465&partnerID=8YFLogxK
U2 - 10.1016/j.csl.2021.101279
DO - 10.1016/j.csl.2021.101279
M3 - 期刊論文
AN - SCOPUS:85115025465
SN - 0885-2308
VL - 72
JO - Computer Speech and Language
JF - Computer Speech and Language
M1 - 101279
ER -