Index wiki database: design and experiments

Computer Science – Information Retrieval

Scientific paper

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

18 pages, 4 tables, 4 figures; FLINS'08, Corpus Linguistics'08, AIS/CAD'08; v2: table 3 changed

Scientific paper

With the fantastic growth of Internet usage, information search in documents of a special type called a "wiki page" that is written using a simple markup language, has become an important problem. This paper describes the software architectural model for indexing wiki texts in three languages (Russian, English, and German) and the interaction between the software components (GATE, Lemmatizer, and Synarcher). The inverted file index database was designed using visual tool DBDesigner. The rules for parsing Wikipedia texts are illustrated by examples. Two index databases of Russian Wikipedia (RW) and Simple English Wikipedia (SEW) are built and compared. The size of RW is by order of magnitude higher than SEW (number of words, lexemes), though the growth rate of number of pages in SEW was found to be 14% higher than in Russian, and the rate of acquisition of new words in SEW lexicon was 7% higher during a period of five months (from September 2007 to February 2008). The Zipf's law was tested with both Russian and Simple Wikipedias. The entire source code of the indexing software and the generated index databases are freely available under GPL (GNU General Public License).

No associations

LandOfFree

Say what you really think

Search LandOfFree.com for scientists and scientific papers. Rate them and share your experience with other people.

Rating

Index wiki database: design and experiments does not yet have a rating. At this time, there are no reviews or comments for this scientific paper.

If you have personal experience with Index wiki database: design and experiments, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Index wiki database: design and experiments will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFWR-SCP-O-386173

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.