A Suit of Record Normalization Methods, From Naive Ones, Globally Mine a Group of Duplicate Records

Mummidi Siva Sankar, Nadella Sunil

Abstract


The promise of Big Data pivots after tending to a few big data integration challenges, for example, record linkage at scale, continuous data combination, and incorporating Deep Web. Although much work has been directed on these issues, there is restricted work on making a uniform, standard record from a gathering of records comparing to a similar genuine element. We allude to this errand as record normalization. Such a record portrayal, instituted normalized record, is significant for both front-end and back-end applications. In this paper, we formalize the record normalization issue, present top to bottom examination of normalization granularity levels (e.g., record, field, and worth segment) and of normalization structures (e.g., common versus complete). We propose an exhaustive structure for registering the normalized record. The proposed system incorporates a suit of record normalization techniques, from guileless ones, which utilize just the data accumulated from records themselves, to complex methodologies, which all around mine a gathering of copy records before choosing an incentive for a quality of a normalized record.


References


K. C.-C. Chang and J. Cho, “Accessing the web: From search to integration,” in SIGMOD, 2006, pp. 804–805.

M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang, “Webtables: Exploring the power of tables on the web,” PVLDB, vol. 1, no. 1, pp. 538–549, 2008.

W. Meng and C. Yu, Advanced Metasearch Engine Technology. Morgan & Claypool Publishers, 2010.

A. Gruenheid, X. L. Dong, and D. Srivastava, “Incremental record linkage,” PVLDB, vol. 7, no. 9, pp. 697–708, May 2014.

E. K. Rezig, E. C. Dragut, M. Ouzzani, and A. K. Elmagarmid, “Query-time record linkage and fusion over web databases,” in ICDE, 2015, pp. 42–53.

W. Su, J. Wang, and F. Lochovsky, “Record matching over query results from multiple web databases,” TKDE, vol. 22, no. 4, 2010.

H. Kopcke and E. Rahm, “Frameworks for entity matching: A ¨ comparison,” DKE, vol. 69, no. 2, pp. 197–210, 2010.

X. Yin, J. Han, and S. Y. Philip, “Truth discovery with multiple conflicting data providers on the web,” ICDE, 2008.

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate record detection: A survey,” TKDE, vol. 19, no. 1, pp. 1–16, 2007.

P. Christen, “A survey of indexing techniques for scalable record linkage and deduplication,” TKDE, vol. 24, no. 9, 2012.

S. Tejada, C. A. Knoblock, and S. Minton, “Learning object identification rules for data integration,” Inf. Sys., vol. 26, no. 8, pp. 607–633, 2001.

L. Shu, A. Chen, M. Xiong, and W. Meng, “Efficient spectral neighborhood blocking for entity resolution,” in ICDE, 2011.

Y. Jiang, C. Lin, W. Meng, C. Yu, A. M. Cohen, and N. R. Smalheiser, “Rule-based deduplication of article records from bibliographic databases,” Database, vol. 2014, 2014.

X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava, “Truth finding on the deep web: Is the problem solved?” in PVLDB, vol. 6, no. 2, 2012, pp. 97–108.

J. Pasternack and D. Roth, “Making better informed trust decisions with generalized fact-finding,” in IJCAI, 2011, pp. 2324–2329.


Full Text: PDF [Full Text]

Refbacks

  • There are currently no refbacks.


Copyright © 2013, All rights reserved.| ijseat.com

Creative Commons License
International Journal of Science Engineering and Advance Technology is licensed under a Creative Commons Attribution 3.0 Unported License.Based on a work at IJSEat , Permissions beyond the scope of this license may be available at http://creativecommons.org/licenses/by/3.0/deed.en_GB.