Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books

Computer Science – Digital Libraries

Scientific paper

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

short version appeared in CASCON 2007 proceedings, available from http://portal.acm.org/citation.cfm?id=1321246

Scientific paper

Collaborative work on unstructured or semi-structured documents, such as in literature corpora or source code, often involves agreed upon templates containing metadata. These templates are not consistent across users and over time. Rule-based parsing of these templates is expensive to maintain and tends to fail as new documents are added. Statistical techniques based on frequent occurrences have the potential to identify automatically a large fraction of the templates, thus reducing the burden on the programmers. We investigate the case of the Project Gutenberg corpus, where most documents are in ASCII format with preambles and epilogues that are often copied and pasted or manually typed. We show that a statistical approach can solve most cases though some documents require knowledge of English. We also survey various technical solutions that make our approach applicable to large data sets.

No associations

LandOfFree

Say what you really think

Search LandOfFree.com for scientists and scientific papers. Rate them and share your experience with other people.

Rating

Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books does not yet have a rating. At this time, there are no reviews or comments for this scientific paper.

If you have personal experience with Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFWR-SCP-O-60021

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.