Factorization-based Lossless Compression of Inverted Indices

Computer Science – Information Retrieval

Scientific paper

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

To Appear as a short paper in CIKM'11

Scientific paper

Many large-scale Web applications that require ranked top-k retrieval such as Web search and online advertising are implemented using inverted indices. An inverted index represents a sparse term-document matrix, where non-zero elements indicate the strength of term-document association. In this work, we present an approach for lossless compression of inverted indices. Our approach maps terms in a document corpus to a new term space in order to reduce the number of non-zero elements in the term-document matrix, resulting in a more compact inverted index. We formulate the problem of selecting a new term space that minimizes the resulting index size as a matrix factorization problem, and prove that finding the optimal factorization is an NP-hard problem. We develop a greedy algorithm for finding an approximate solution. A side effect of our approach is increasing the number of terms in the index, which may negatively affect query evaluation performance. To eliminate such effect, we develop a methodology for modifying query evaluation algorithms by exploiting specific properties of our compression approach. Our experimental evaluation demonstrates that our approach achieves an index size reduction of 20%, while maintaining the same query response times. Higher compression ratios up to 35% are achievable, however at the cost of slightly longer query response times. Furthermore, combining our approach with other lossless compression techniques, namely variable-byte encoding, leads to index size reduction of up to 50%.

No associations

LandOfFree

Say what you really think

Search LandOfFree.com for scientists and scientific papers. Rate them and share your experience with other people.

Rating

Factorization-based Lossless Compression of Inverted Indices does not yet have a rating. At this time, there are no reviews or comments for this scientific paper.

If you have personal experience with Factorization-based Lossless Compression of Inverted Indices, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Factorization-based Lossless Compression of Inverted Indices will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFWR-SCP-O-70422

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.