Similarity-Based Estimation of Word Cooccurrence Probabilities

Computer Science – Computation and Language

Scientific paper

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

13 pages, to appear in proceedings of ACL-94

Scientific paper

In many applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations ``eat a peach'' and ``eat a beach'' is more likely. Statistical NLP methods determine the likelihood of a word combination according to its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in a given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on ``most similar'' words. We describe a probabilistic word association model based on distributional word similarity, and apply it to improving probability estimates for unseen word bigrams in a variant of Katz's back-off model. The similarity-based method yields a 20% perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speech-recognition error.

No associations

LandOfFree

Say what you really think

Search LandOfFree.com for scientists and scientific papers. Rate them and share your experience with other people.

Rating

Similarity-Based Estimation of Word Cooccurrence Probabilities does not yet have a rating. At this time, there are no reviews or comments for this scientific paper.

If you have personal experience with Similarity-Based Estimation of Word Cooccurrence Probabilities, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Similarity-Based Estimation of Word Cooccurrence Probabilities will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFWR-SCP-O-139069

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.