Probabilistic parallelisation of blocking non-matched records for big data

Chenxiao Dou, Daniel Sun, Yi Cheng Chen, Guoqiang Li, Jianquan Liu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Blocking is a technique of filtering unlikely matched pairs for record matching, which aims to collect all pairs of records that relate to the same entities across different data sources. Blocking has been broadly adopted in data mining and database. However, for big data, there is no fast and effective blocking algorithm yet, because the number of candidate pairs is tremendous between large data sets. In this paper, we report on a probabilistic parallelisation of a recently proposed blocking that is a sequential algorithm for efficient record matching in single machines. Our approach runs blocking processes distributedly on partitioned input data. In order to reduce data exchange among those blocking processes, we adopt a probabilistic technique to assure that the processes can run independently and meanwhile the aggregated result is correct with respect to common metrics. Our experimental analysis endorses the advantage of our technique and shows its novel scalability on a Hadoop map-reduce system deployed physically in a cloud.

Original languageEnglish
Title of host publicationProceedings - 2016 IEEE International Conference on Big Data, Big Data 2016
EditorsRonay Ak, George Karypis, Yinglong Xia, Xiaohua Tony Hu, Philip S. Yu, James Joshi, Lyle Ungar, Ling Liu, Aki-Hiro Sato, Toyotaro Suzumura, Sudarsan Rachuri, Rama Govindaraju, Weijia Xu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages3465-3473
Number of pages9
ISBN (Electronic)9781467390040
DOIs
StatePublished - 2016
Event4th IEEE International Conference on Big Data, Big Data 2016 - Washington, United States
Duration: 5 Dec 20168 Dec 2016

Publication series

NameProceedings - 2016 IEEE International Conference on Big Data, Big Data 2016

Conference

Conference4th IEEE International Conference on Big Data, Big Data 2016
Country/TerritoryUnited States
CityWashington
Period5/12/168/12/16

Fingerprint

Dive into the research topics of 'Probabilistic parallelisation of blocking non-matched records for big data'. Together they form a unique fingerprint.

Cite this