Abstract
Recently, methods from the statistical physics of complex systems have been applied successfully to identify universal features in the long-range correlations (LRCs) of written texts. However, in real texts, these universal features are being intermingled with language-specific influences. This paper aims at the characterization and further understanding of the interplay between universal and language-specific effects on the LRCs in texts. To this end, we apply the language-sensitive mapping of written texts to word-length series (wls) and analyse large parallel (of same content) corpora from 10 languages classified to four families (Romanic, Germanic, Greek and Uralic). The autocorrelation functions of the wls reveal tiny but persistent LRCs decaying at large scales following a power-law with a language-independent exponent ∼0.60–0.65. The impact of language is displayed in the amplitude of correlations where a relative standard deviation >40% among the analyzed languages is observed. The classification to language families seems to play a significant role since, the Finnish and Germanic languages exhibit more correlations than the Greek and Roman families. To reveal the origins of the LRCs, we focus on the long words and perform burst and correlation analysis in their positions along the corpora. We find that the universal features are linked more to the correlations of the inter-long word distances while the language-specific aspects are related more to their distributions.
Original language | English |
---|---|
Journal | International Journal of Modern Physics B |
Volume | 30 |
Issue number | 15 |
Early online date | 20 Aug 2015 |
DOIs | |
Publication status | Published - 2016 |
Publication type | A1 Journal article-refereed |
Publication forum classification
- Publication forum level 1