Murre24: Dialect Identification of Finnish Internet Forum Messages

Tutkimustuotos: KonferenssiartikkeliTieteellinenvertaisarvioitu

7 Lataukset (Pure)

Abstrakti

This paper presents Murre24, a collection of dialectal messages posted on the largest Finnish internet forum, Suomi24. The messages posted in Finnish on the forum between 2001 and 2020 are classified to present either the standard language, one of the seven traditional dialects, a colloquial style or the Helsinki slang. We present a manually annotated dataset used to train dialect identification models as well as the automatic annotation of almost 94 million messages in total. We experiment with five different dialect identification methods and evaluate them on dialectally balanced and random test samples. The best performing method for differentiating standard Finnish from non-standard Finnish is a character n-gram based support vector machine (SVM), while fine-tuning a BERT-based model achieves best scores in the final dialect identification task. According to the automatic classification, most of the messages written on the forum are in standard Finnish, and most of the non-standard messages are in a colloquial variety used typically by young speakers in Finland. We moreover show that the proportion of non-standard messages declines over time, but the proportion of the traditional dialects stays relatively steady.
AlkuperäiskieliEnglanti
OtsikkoProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
ToimittajatNicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
JulkaisupaikkaTorino, Italy
KustantajaEuropean Language Resources Association (ELRA)
Sivut12003-12015
Sivumäärä13
ISBN (elektroninen)978-2-493814-10-4
TilaJulkaistu - toukok. 2024
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisussa
TapahtumaJoint International Conference on Computational Linguistics, Language Resources and Evaluation - Torino, Italia
Kesto: 20 toukok. 202425 toukok. 2024

Julkaisusarja

NimiLREC proceedings
KustantajaEuropean Language Resources Association (ELRA)
ISSN (elektroninen)2522-2686
NimiInternational Conference on Computational Linguistics
KustantajaInternational Committee on Computational Linguistics
ISSN (painettu)2951-2093

Conference

ConferenceJoint International Conference on Computational Linguistics, Language Resources and Evaluation
Maa/AlueItalia
KaupunkiTorino
Ajanjakso20/05/2425/05/24

Rahoitus

This work has been supported by the Academy of Finland through project No. 342859 “CorCoDial – Corpus-based computational dialectology” and by the Kone Foundation through project "LANGAWARE”.

Julkaisufoorumi-taso

  • Jufo-taso 1

Sormenjälki

Sukella tutkimusaiheisiin 'Murre24: Dialect Identification of Finnish Internet Forum Messages'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

Siteeraa tätä