Projects per year
Abstract
Information extraction, transformation and loading (abbreviated as ETL) tools are important for big data analysis and value-added applications on the Web. Typical Web scraping systems such as “Dexi.io” or “Import.io” allow users to specify where to fetch the page and what information or data to be extracted from the page. Although these commercial services already provide a graphical user interface to guide the system to the target pages for each data source, such systems are not scalable because users have to create crawlers one by one. In this paper we consider the problem of pagination recognition, which aims to automate the process of finding similar pages by locating the next page link and the list of page links from any starting URL. We propose a neural sequence model that will label each clickable link in a page as either “NEXT”, “PAGE” or “Other”, where the first two could guide the system to find similar pages of the seed URL. To have multilingual support, we have exploited the attribute contents in the links as well as Language-Agnostic SEntence Representations (LASER) for anchor text embedding. The experimental results show that the proposed model, achieves an average of micro 0.834 and macro 0.818 F1 score on pagination recognition. In terms of practical deployment, we are able to automatically create 1,060 (MDR) and 153 (DCADE) data APIs from 392 event source pages within 62 min.
Original language | English |
---|---|
Title of host publication | Web Engineering - 22nd International Conference, ICWE 2022, Proceedings |
Editors | Tommaso Di Noia, In-Young Ko, Markus Schedl, Carmelo Ardito |
Publisher | Springer Science and Business Media Deutschland GmbH |
Pages | 117-131 |
Number of pages | 15 |
ISBN (Print) | 9783031099168 |
DOIs | |
State | Published - 2022 |
Event | 22nd International Conference on Web Engineering, ICWE 2022 - Bari, Italy Duration: 5 Jul 2022 → 8 Jul 2022 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 13362 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 22nd International Conference on Web Engineering, ICWE 2022 |
---|---|
Country/Territory | Italy |
City | Bari |
Period | 5/07/22 → 8/07/22 |
Keywords
- Announcement extraction
- Neural sequence labeling
- Pagination recognition
- Web Data ETL scalability
Fingerprint
Dive into the research topics of 'Automatic Web Data API Creation via Cross-Lingual Neural Pagination Recognition'. Together they form a unique fingerprint.Projects
- 1 Finished
-
Eventgo: Constructing an Event Search Engine via Event Extraction from Social-Media Posts and Event Source Discovery(2/3)
Chang, C.-H. (PI)
1/08/21 → 31/07/22
Project: Research