Automatic Web Data API Creation via Cross-Lingual Neural Pagination Recognition

Chia Hui Chang, Cheng Ju Wu, Tzu Ping Lin

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Information extraction, transformation and loading (abbreviated as ETL) tools are important for big data analysis and value-added applications on the Web. Typical Web scraping systems such as “Dexi.io” or “Import.io” allow users to specify where to fetch the page and what information or data to be extracted from the page. Although these commercial services already provide a graphical user interface to guide the system to the target pages for each data source, such systems are not scalable because users have to create crawlers one by one. In this paper we consider the problem of pagination recognition, which aims to automate the process of finding similar pages by locating the next page link and the list of page links from any starting URL. We propose a neural sequence model that will label each clickable link in a page as either “NEXT”, “PAGE” or “Other”, where the first two could guide the system to find similar pages of the seed URL. To have multilingual support, we have exploited the attribute contents in the links as well as Language-Agnostic SEntence Representations (LASER) for anchor text embedding. The experimental results show that the proposed model, achieves an average of micro 0.834 and macro 0.818 F1 score on pagination recognition. In terms of practical deployment, we are able to automatically create 1,060 (MDR) and 153 (DCADE) data APIs from 392 event source pages within 62 min.

Original languageEnglish
Title of host publicationWeb Engineering - 22nd International Conference, ICWE 2022, Proceedings
EditorsTommaso Di Noia, In-Young Ko, Markus Schedl, Carmelo Ardito
PublisherSpringer Science and Business Media Deutschland GmbH
Pages117-131
Number of pages15
ISBN (Print)9783031099168
DOIs
StatePublished - 2022
Event22nd International Conference on Web Engineering, ICWE 2022 - Bari, Italy
Duration: 5 Jul 20228 Jul 2022

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13362 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference22nd International Conference on Web Engineering, ICWE 2022
Country/TerritoryItaly
CityBari
Period5/07/228/07/22

Keywords

  • Announcement extraction
  • Neural sequence labeling
  • Pagination recognition
  • Web Data ETL scalability

Fingerprint

Dive into the research topics of 'Automatic Web Data API Creation via Cross-Lingual Neural Pagination Recognition'. Together they form a unique fingerprint.

Cite this