Text Representation Methods for Big Social Data

Research output: Book/ReportDoctoral thesisCollection of Articles

Abstract

The widespread use of digital platforms and the exponential growth of online user- generated content have increased the need to develop efficient systems to manage and analyze vast amounts of information. Part of this information is generated by users in the form of posts, which primarily consist of textual data. To address this, text representation methods are employed to transform the raw text into numerical representations that machine learning algorithms can process. This thesis evaluates the suitability of various text representation methods in the data context of online digital platforms, defined as Big Social Data. The conducted research aims to identify suitable text representation methods for various text analysis tasks. Additionally, it explores how text representations can enhance matching in microblogging platforms. Thirdly, this thesis presents an approach that integrates multiple text representation methods to leverage their individual strengths and improve performance.

Text representation methods were assessed in two multi-label classification (MLC) applications and one duplicate posts classification task. The first MLC application used six text representation methods to predict post tags in an online Question and Answering Forum. The second application evaluated four classification models as keyword-suggesting tools for labeling Social Science Survey Studies. The traditional text representation methods performed better in the second MLC application, whereas neural network-based methods showed better results in the duplicate posts classification task.

To address certain limitations of text representation methods, this thesis investigates integrating multiple text representation methods, thus leveraging their individual strengths. Specifically, it evaluates a proposed framework solution that combines multiple text representation methods based on the ensemble learning approach in the duplicate posts-classification tasks. The approach achieved higher accuracy when employing several text representation methods than individual ones. Furthermore, results showed that combining methods with different properties can further im prove performance.

Finally, this research designed an approach to complement social network analysis with text representation methods for enhancing user matching in microblogging platforms. Through a combination of user-centered social networks and content analysis, various matching strategies were defined in the context of a popular microblogging platform. A text representation method was used to enable measuring the text-based content similarity between users. The results obtained from a user experience evaluation indicated favorable feedback for the employed approach.

This thesis assessed text representation methods in various online digital platforms and identified their strengths and limitations. An ensemble learning-based framework that combines multiple text representation methods is proposed to over- come limitations, achieving higher accuracy in a classification task. Additionally, the thesis presented an approach to complement social network analysis with text representation methods for user matching in microblogging platforms, resulting in positive feedback from a user experience evaluation. These findings contribute to the development of more effective text representation methods and their applications in similar contexts.
Original languageEnglish
Place of PublicationTampere
PublisherTampere University
ISBN (Electronic)978-952-03-2976-1
ISBN (Print)978-952-03-2975-4
Publication statusPublished - 2023
Publication typeG5 Doctoral dissertation (articles)

Publication series

NameTampere University Dissertations - Tampereen yliopiston väitöskirjat
Volume830
ISSN (Print)2489-9860
ISSN (Electronic)2490-0028

Fingerprint

Dive into the research topics of 'Text Representation Methods for Big Social Data'. Together they form a unique fingerprint.

Cite this