The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages

Computer Science – Computation and Language

Scientific paper

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

A multilingual textual resource with meta-data freely available for download at http://langtech.jrc.it/JRC-Acquis.html

Scientific paper

We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in all 20 official EUanguages, with additional documents being available in the languages of the EU candidate countries. The corpus consists of almost 8,000 documents per language, with an average size of nearly 9 million words per language. Pair-wise paragraph alignment information produced by two different aligners (Vanilla and HunAlign) is available for all 190+ language pair combinations. Most texts have been manually classified according to the EUROVOC subject domains so that the collection can also be used to train and test multi-label classification algorithms and keyword-assignment software. The corpus is encoded in XML, according to the Text Encoding Initiative Guidelines. Due to the large number of parallel texts in many languages, the JRC-Acquis is particularly suitable to carry out all types of cross-language research, as well as to test and benchmark text analysis software across different languages (for instance for alignment, sentence splitting and term extraction).

No associations

LandOfFree

Say what you really think

Search LandOfFree.com for scientists and scientific papers. Rate them and share your experience with other people.

Rating

The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages does not yet have a rating. At this time, there are no reviews or comments for this scientific paper.

If you have personal experience with The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFWR-SCP-O-691108

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.