Computer Science – Computation and Language
Scientific paper
2008-10-07
Verbum ex machina: Actes de la 13eme conference annuelle sur le Traitement Automatique des Langues Naturelles (TALN 2006), p.
Computer Science
Computation and Language
In French. 10 pages, 5 figures, LaTeX 2e using EPSF and custom package taln2006.sty (designed by Pierre Zweigenbaum, ATALA). P
Scientific paper
We propose a theoretical framework within which information on the vocabulary of a given corpus can be inferred on the basis of statistical information gathered on that corpus. Inferences can be made on the categories of the words in the vocabulary, and on their syntactical properties within particular languages. Based on the same statistical data, it is possible to build matrices of syntagmatic similarity (bigram transition matrices) or paradigmatic similarity (probability for any pair of words to share common contexts). When clustered with respect to their syntagmatic similarity, words tend to group into sublanguage vocabularies, and when clustered with respect to their paradigmatic similarity, into syntactic or semantic classes. Experiments have explored the first of these two possibilities. Their results are interpreted in the frame of a Markov chain modelling of the corpus' generative processe(s): we show that the results of a spectral analysis of the transition matrix can be interpreted as probability distributions of words within clusters. This method yields a soft clustering of the vocabulary into sublanguages which contribute to the generation of heterogeneous corpora. As an application, we show how multilingual texts can be visually segmented into linguistically homogeneous segments. Our method is specifically useful in the case of related languages which happened to be mixed in corpora.
Henry Claudia
Nock Richard
Vaillant Pascal
No associations
LandOfFree
Analyse spectrale des textes: détection automatique des frontières de langue et de discours does not yet have a rating. At this time, there are no reviews or comments for this scientific paper.
If you have personal experience with Analyse spectrale des textes: détection automatique des frontières de langue et de discours, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Analyse spectrale des textes: détection automatique des frontières de langue et de discours will most certainly appreciate the feedback.
Profile ID: LFWR-SCP-O-557322