Computer Science – Databases
Scientific paper
2011-06-03
Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 4, pp. 322-333 (2011)
Computer Science
Databases
41 pages including the appendix. Shorter version (without appendix) to appear as a full research paper in VLDB 2012
Scientific paper
The digitization of scanned forms and documents is changing the data sources that enterprises manage. To integrate these new data sources with enterprise data, the current state-of-the-art approach is to convert the images to ASCII text using optical character recognition (OCR) software and then to store the resulting ASCII text in a relational database. The OCR problem is challenging, and so the output of OCR often contains errors. In turn, queries on the output of OCR may fail to retrieve relevant answers. State-of-the-art OCR programs, e.g., the OCR powering Google Books, use a probabilistic model that captures many alternatives during the OCR process. Only when the results of OCR are stored in the database, do these approaches discard the uncertainty. In this work, we propose to retain the probabilistic models produced by OCR process in a relational database management system. A key technical challenge is that the probabilistic data produced by OCR software is very large (a single book blows up to 2GB from 400kB as ASCII). As a result, a baseline solution that integrates these models with an RDBMS is over 1000x slower versus standard text processing for single table select-project queries. However, many applications may have quality-performance needs that are in between these two extremes of ASCII and the complete model output by the OCR software. Thus, we propose a novel approximation scheme called Staccato that allows a user to trade recall for query performance. Additionally, we provide a formal analysis of our scheme's properties, and describe how we integrate our scheme with standard-RDBMS text indexing.
Kumar Arun
Re Christopher
No associations
LandOfFree
Probabilistic Management of OCR Data using an RDBMS does not yet have a rating. At this time, there are no reviews or comments for this scientific paper.
If you have personal experience with Probabilistic Management of OCR Data using an RDBMS, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Probabilistic Management of OCR Data using an RDBMS will most certainly appreciate the feedback.
Profile ID: LFWR-SCP-O-224258