Multi-Task Neural Sequence Labeling for Zero-Shot Cross-Language Boilerplate Removal

Yu Hao Wu, Chia Hui Chang

研究成果: 書貢獻/報告類型會議論文篇章同行評審

摘要

Although web pages are rich in resources, they are usually intertwined with advertisements, banners, navigation bars, footer copyrights and other templates, which are often not of interest to users. In this paper, we study the problem of extracting the main content and removing irrelevant information from web pages. The common solution is to classify each web component into boilerplate (noise) or main content. State-of-the-art approaches such as BoilerNet use neural sequence labeling to achieve an impressive score in CleanEval EN dataset. However, the model uses only the top 50 HTML tags as input features, which does not fully utilize the power of tag information. In addition, the most frequent 1,000 words used for text content representation cannot effectively support a real-world environment in which web pages appear in multiple languages. In this paper, we propose a multi-task learning framework based on two auxiliary tasks: depth prediction and position prediction. We explore HTML tag embedding for tag path representation learning. Further, we employ multilingual Bidirectional Encoder Representations from Transformers (BERT) for text content representation to deal with any web pages without language limitations. The experiments show that HTML tag embedding and multi-task learning frameworks achieve much higher scores than using BoilerNet on CleanEval EN datasets. Secondly, the pre-trained text block representation based on multilingual BERT will degrade the performance on EN test sets; however, zero-shot experiments on three languages (Chinese, Japanese, and Thai) have a performance consistent with the five-fold cross-validation of the respective language, which indicates the possibility of providing cross-lingual support in one model.

原文???core.languages.en_GB???
主出版物標題Proceedings - 2021 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2021
發行者Association for Computing Machinery
頁面326-334
頁數9
ISBN(電子)9781450391153
DOIs
出版狀態已出版 - 14 12月 2021
事件2021 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2021 - Virtual, Online, Australia
持續時間: 14 12月 202117 12月 2021

出版系列

名字ACM International Conference Proceeding Series

???event.eventtypes.event.conference???

???event.eventtypes.event.conference???2021 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2021
國家/地區Australia
城市Virtual, Online
期間14/12/2117/12/21

指紋

深入研究「Multi-Task Neural Sequence Labeling for Zero-Shot Cross-Language Boilerplate Removal」主題。共同形成了獨特的指紋。

引用此