Automatic Web Data API Creation via Cross-Lingual Neural Pagination Recognition

Chia Hui Chang, Cheng Ju Wu, Tzu Ping Lin

研究成果: 書貢獻/報告類型會議論文篇章同行評審

摘要

Information extraction, transformation and loading (abbreviated as ETL) tools are important for big data analysis and value-added applications on the Web. Typical Web scraping systems such as “Dexi.io” or “Import.io” allow users to specify where to fetch the page and what information or data to be extracted from the page. Although these commercial services already provide a graphical user interface to guide the system to the target pages for each data source, such systems are not scalable because users have to create crawlers one by one. In this paper we consider the problem of pagination recognition, which aims to automate the process of finding similar pages by locating the next page link and the list of page links from any starting URL. We propose a neural sequence model that will label each clickable link in a page as either “NEXT”, “PAGE” or “Other”, where the first two could guide the system to find similar pages of the seed URL. To have multilingual support, we have exploited the attribute contents in the links as well as Language-Agnostic SEntence Representations (LASER) for anchor text embedding. The experimental results show that the proposed model, achieves an average of micro 0.834 and macro 0.818 F1 score on pagination recognition. In terms of practical deployment, we are able to automatically create 1,060 (MDR) and 153 (DCADE) data APIs from 392 event source pages within 62 min.

原文???core.languages.en_GB???
主出版物標題Web Engineering - 22nd International Conference, ICWE 2022, Proceedings
編輯Tommaso Di Noia, In-Young Ko, Markus Schedl, Carmelo Ardito
發行者Springer Science and Business Media Deutschland GmbH
頁面117-131
頁數15
ISBN(列印)9783031099168
DOIs
出版狀態已出版 - 2022
事件22nd International Conference on Web Engineering, ICWE 2022 - Bari, Italy
持續時間: 5 7月 20228 7月 2022

出版系列

名字Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
13362 LNCS
ISSN(列印)0302-9743
ISSN(電子)1611-3349

???event.eventtypes.event.conference???

???event.eventtypes.event.conference???22nd International Conference on Web Engineering, ICWE 2022
國家/地區Italy
城市Bari
期間5/07/228/07/22

指紋

深入研究「Automatic Web Data API Creation via Cross-Lingual Neural Pagination Recognition」主題。共同形成了獨特的指紋。

引用此