OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set

Computer Science – Computation and Language

Scientific paper

Rate now

[ 0.00 ] – not rated yet Voters 0 Comments 0

Details OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set

: 2012-04-01
: arxiv.org/abs/1204.0188v1
: Computer Science
: Computation and Language

: LACSC - Lebanese Association for Computational Sciences, http://www.lacsc.org/; American Journal of Scientific Research, Issue
: Scientific paper
: Since the dawn of the computing era, information has been represented digitally so that it can be processed by electronic computers. Paper books and documents were abundant and widely being published at that time; and hence, there was a need to convert them into digital format. OCR, short for Optical Character Recognition was conceived to translate paper-based books into digital e-books. Regrettably, OCR systems are still erroneous and inaccurate as they produce misspellings in the recognized text, especially when the source document is of low printing quality. This paper proposes a post-processing OCR context-sensitive error correction method for detecting and correcting non-word and real-word OCR errors. The cornerstone of this proposed approach is the use of Google Web 1T 5-gram data set as a dictionary of words to spell-check OCR text. The Google data set incorporates a very large vocabulary and word statistics entirely reaped from the Internet, making it a reliable source to perform dictionary-based error correction. The core of the proposed solution is a combination of three algorithms: The error detection, candidate spellings generator, and error correction algorithms, which all exploit information extracted from Google Web 1T 5-gram data set. Experiments conducted on scanned images written in different languages showed a substantial improvement in the OCR error correction rate. As future developments, the proposed algorithm is to be parallelised so as to support parallel and distributed computing architectures.

Affiliated with

Alwani Mohammad

Computer Science – Software Engineering

Scientist

[ 0.00 ] – not rated yet Voters 0 Comments 0

Bassil Youssef

Computer Science – Software Engineering

Scientist

[ 0.00 ] – not rated yet Voters 0 Comments 0

Also associated with

No associations

LandOfFree

Say what you really think

Search LandOfFree.com for scientists and scientific papers. Rate them and share your experience with other people.

Rating

OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set does not yet have a rating. At this time, there are no reviews or comments for this scientific paper.
If you have personal experience with OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set will most certainly appreciate the feedback.

Rate now

Comments { 0 }

Profile ID: LFWR-SCP-O-552413

All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.

Canada

Charities
Companies
MP Candidates
Patents
Employee Salary Disclosure

World

Places of the World
Scientific Papers

United States

Banks
Companies
Counties
Patents
Employee Salary Disclosure