Bayesian Data Cleaning for Web Data

Computer Science – Databases

Scientific paper

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

6 pages, 7 figures

Scientific paper

Data Cleaning is a long standing problem, which is growing in importance with the mass of uncurated web data. State of the art approaches for handling inconsistent data are systems that learn and use conditional functional dependencies (CFDs) to rectify data. These methods learn data patterns--CFDs--from a clean sample of the data and use them to rectify the dirty/inconsistent data. While getting a clean training sample is feasible in enterprise data scenarios, it is infeasible in web databases where there is no separate curated data. CFD based methods are unfortunately particularly sensitive to noise; we will empirically demonstrate that the number of CFDs learned falls quite drastically with even a small amount of noise. In order to overcome this limitation, we propose a fully probabilistic framework for cleaning data. Our approach involves learning both the generative and error (corruption) models of the data and using them to clean the data. For generative models, we learn Bayes networks from the data. For error models, we consider a maximum entropy framework for combing multiple error processes. The generative and error models are learned directly from the noisy data. We present the details of the framework and demonstrate its effectiveness in rectifying web data.

No associations

LandOfFree

Say what you really think

Search LandOfFree.com for scientists and scientific papers. Rate them and share your experience with other people.

Rating

Bayesian Data Cleaning for Web Data does not yet have a rating. At this time, there are no reviews or comments for this scientific paper.

If you have personal experience with Bayesian Data Cleaning for Web Data, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Bayesian Data Cleaning for Web Data will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFWR-SCP-O-289147

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.