Computer Science – Information Retrieval
Scientific paper
2011-11-03
Computer Science
Information Retrieval
11 pages, 8 figures, 5 tables
Scientific paper
With the explosion of information stored world-wide,data intensive computing has become a central area of research.Efficient management and processing of this massively exponential amount of data from diverse sources,such as telecommunication call data records,online transaction records,etc.,has become a necessity.Removing redundancy from such huge(multi-billion records) datasets resulting in resource and compute efficiency for downstream processing constitutes an important area of study. "Intelligent compression" or deduplication in streaming scenarios,for precise identification and elimination of duplicates from the unbounded datastream is a greater challenge given the realtime nature of data arrival.Stable Bloom Filters(SBF) address this problem to a certain extent.However,SBF suffers from a high false negative rate(FNR) and slow convergence rate,thereby rendering it inefficient for applications with low FNR tolerance.In this paper, we present a novel Reservoir Sampling based Bloom Filter,(RSBF) data structure,based on the combined concepts of reservoir sampling and Bloom filters for approximate detection of duplicates in data streams.Using detailed theoretical analysis we prove analytical bounds on its false positive rate(FPR),false negative rate(FNR) and convergence rates with low memory requirements.We show that RSBF offers the currently lowest FN and convergence rates,and are better than those of SBF while using the same memory.Using empirical analysis on real-world datasets(3 million records) and synthetic datasets with around 1 billion records,we demonstrate upto 2x improvement in FNR with better convergence rates as compared to SBF,while exhibiting comparable FPR.To the best of our knowledge,this is the first attempt to integrate reservoir sampling method with Bloom filters for deduplication in streaming scenarios.
Bhattacherjee Souvik
Dutta Sourav
Narang Ankur
No associations
LandOfFree
Towards "Intelligent Compression" in Streams: A Biased Reservoir Sampling based Bloom Filter Approach does not yet have a rating. At this time, there are no reviews or comments for this scientific paper.
If you have personal experience with Towards "Intelligent Compression" in Streams: A Biased Reservoir Sampling based Bloom Filter Approach, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Towards "Intelligent Compression" in Streams: A Biased Reservoir Sampling based Bloom Filter Approach will most certainly appreciate the feedback.
Profile ID: LFWR-SCP-O-701515