Towards the quantification of the semantic information encoded in written language

Physics – Physics and Society

Scientific paper

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

19 pages, 4 figures

Scientific paper

10.1142/S0219525910002530

Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text. This scenario can be explained by a model of word usage that assumes that words are distributed along the text in domains of a characteristic size where their frequency is higher than elsewhere. Our conclusions are based on the analysis of a large database of written language, diverse in subjects and styles, and thus are likely to be applicable to general language sequences encoding complex information.

No associations

LandOfFree

Say what you really think

Search LandOfFree.com for scientists and scientific papers. Rate them and share your experience with other people.

Rating

Towards the quantification of the semantic information encoded in written language does not yet have a rating. At this time, there are no reviews or comments for this scientific paper.

If you have personal experience with Towards the quantification of the semantic information encoded in written language, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Towards the quantification of the semantic information encoded in written language will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFWR-SCP-O-341807

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.