Abstrakti
Text normalization methods have been commonly applied to historical language or user-generated content, but less often to dialectal transcriptions. In this paper, we introduce dialect-to-standard normalization -- i.e., mapping phonetic transcriptions from different dialects to the orthographic norm of the standard variety -- as a distinct sentence-level character transduction task and provide a large-scale analysis of dialect-to-standard normalization methods. To this end, we compile a multilingual dataset covering four languages: Finnish, Norwegian, Swiss German and Slovene. For the two biggest corpora, we provide three different data splits corresponding to different use cases for automatic normalization. We evaluate the most successful sequence-to-sequence model architectures proposed for text normalization tasks using different tokenization approaches and context sizes. We find that a character-level Transformer trained on sliding windows of three words works best for Finnish, Swiss German and Slovene, whereas the pre-trained byT5 model using full sentences obtains the best results for Norwegian. Finally, we perform an error analysis to evaluate the effect of different data splits on model performance.
Alkuperäiskieli | Englanti |
---|---|
Otsikko | Findings of the Association for Computational Linguistics: EMNLP 2023 |
Toimittajat | Houda Bouamor, Juan Pino, Kalika Bali |
Julkaisupaikka | Singapore |
Kustantaja | ASSOCIATION FOR COMPUTATIONAL LINGUISTICS |
Sivut | 13814-13828 |
Sivumäärä | 15 |
ISBN (painettu) | 979-8-89176-061-5 |
DOI - pysyväislinkit | |
Tila | Julkaistu - 1 jouluk. 2023 |
OKM-julkaisutyyppi | A4 Artikkeli konferenssijulkaisussa |
Tapahtuma | Conference on Empirical Methods in Natural Language Processing - , Singapore Kesto: 6 jouluk. 2023 → 10 jouluk. 2023 |
Conference
Conference | Conference on Empirical Methods in Natural Language Processing |
---|---|
Lyhennettä | EMNLP |
Maa/Alue | Singapore |
Ajanjakso | 6/12/23 → 10/12/23 |
Rahoitus
Academy of Finland through project No. 342859 “CorCoDial – Corpus-based computational dialectology”
Rahoittajat | Rahoittajan numero |
---|---|
Suomen Akatemia / Academy of Finland | 342859 |
Julkaisufoorumi-taso
- Jufo-taso 1