Sketch-Based Estimation of Subpopulation-Weight

Computer Science – Databases

Scientific paper

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Scientific paper

Summaries of massive data sets support approximate query processing over the original data. A basic aggregate over a set of records is the weight of subpopulations specified as a predicate over records' attributes. Bottom-k sketches are a powerful summarization format of weighted items that includes priority sampling and the classic weighted sampling without replacement. They can be computed efficiently for many representations of the data including distributed databases and data streams. We derive novel unbiased estimators and efficient confidence bounds for subpopulation weight. Our estimators and bounds are tailored by distinguishing between applications (such as data streams) where the total weight of the sketched set can be computed by the summarization algorithm without a significant use of additional resources, and applications (such as sketches of network neighborhoods) where this is not the case. Our rigorous derivations are based on clever applications of the Horvitz-Thompson estimator, and are complemented by efficient computational methods. We demonstrate their benefit on a wide range of Pareto distributions.

No associations

LandOfFree

Say what you really think

Search LandOfFree.com for scientists and scientific papers. Rate them and share your experience with other people.

Rating

Sketch-Based Estimation of Subpopulation-Weight does not yet have a rating. At this time, there are no reviews or comments for this scientific paper.

If you have personal experience with Sketch-Based Estimation of Subpopulation-Weight, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Sketch-Based Estimation of Subpopulation-Weight will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFWR-SCP-O-571578

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.