Probabilistic parallelisation of blocking non-matched records for big data

Chenxiao Dou, Daniel Sun, Yi Cheng Chen, Guoqiang Li, Jianquan Liu

研究成果: 書貢獻/報告類型會議論文篇章同行評審

3 引文 斯高帕斯(Scopus)

摘要

Blocking is a technique of filtering unlikely matched pairs for record matching, which aims to collect all pairs of records that relate to the same entities across different data sources. Blocking has been broadly adopted in data mining and database. However, for big data, there is no fast and effective blocking algorithm yet, because the number of candidate pairs is tremendous between large data sets. In this paper, we report on a probabilistic parallelisation of a recently proposed blocking that is a sequential algorithm for efficient record matching in single machines. Our approach runs blocking processes distributedly on partitioned input data. In order to reduce data exchange among those blocking processes, we adopt a probabilistic technique to assure that the processes can run independently and meanwhile the aggregated result is correct with respect to common metrics. Our experimental analysis endorses the advantage of our technique and shows its novel scalability on a Hadoop map-reduce system deployed physically in a cloud.

原文???core.languages.en_GB???
主出版物標題Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016
編輯Ronay Ak, George Karypis, Yinglong Xia, Xiaohua Tony Hu, Philip S. Yu, James Joshi, Lyle Ungar, Ling Liu, Aki-Hiro Sato, Toyotaro Suzumura, Sudarsan Rachuri, Rama Govindaraju, Weijia Xu
發行者Institute of Electrical and Electronics Engineers Inc.
頁面3465-3473
頁數9
ISBN(電子)9781467390040
DOIs
出版狀態已出版 - 2016
事件4th IEEE International Conference on Big Data, Big Data 2016 - Washington, United States
持續時間: 5 12月 20168 12月 2016

出版系列

名字Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016

???event.eventtypes.event.conference???

???event.eventtypes.event.conference???4th IEEE International Conference on Big Data, Big Data 2016
國家/地區United States
城市Washington
期間5/12/168/12/16

指紋

深入研究「Probabilistic parallelisation of blocking non-matched records for big data」主題。共同形成了獨特的指紋。

引用此