OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set

Computer Science – Computation and Language

Scientific paper

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

LACSC - Lebanese Association for Computational Sciences, http://www.lacsc.org/; American Journal of Scientific Research, Issue

Scientific paper

Since the dawn of the computing era, information has been represented digitally so that it can be processed by electronic computers. Paper books and documents were abundant and widely being published at that time; and hence, there was a need to convert them into digital format. OCR, short for Optical Character Recognition was conceived to translate paper-based books into digital e-books. Regrettably, OCR systems are still erroneous and inaccurate as they produce misspellings in the recognized text, especially when the source document is of low printing quality. This paper proposes a post-processing OCR context-sensitive error correction method for detecting and correcting non-word and real-word OCR errors. The cornerstone of this proposed approach is the use of Google Web 1T 5-gram data set as a dictionary of words to spell-check OCR text. The Google data set incorporates a very large vocabulary and word statistics entirely reaped from the Internet, making it a reliable source to perform dictionary-based error correction. The core of the proposed solution is a combination of three algorithms: The error detection, candidate spellings generator, and error correction algorithms, which all exploit information extracted from Google Web 1T 5-gram data set. Experiments conducted on scanned images written in different languages showed a substantial improvement in the OCR error correction rate. As future developments, the proposed algorithm is to be parallelised so as to support parallel and distributed computing architectures.

No associations

LandOfFree

Say what you really think

Search LandOfFree.com for scientists and scientific papers. Rate them and share your experience with other people.

Rating

OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set does not yet have a rating. At this time, there are no reviews or comments for this scientific paper.

If you have personal experience with OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFWR-SCP-O-552413

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.