Assessing Text Representation Methods on Tag Prediction Task for StackOverflow

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

1 Citation (Scopus)
10 Downloads (Pure)

Abstract

A large part of knowledge evolves outside of the operations of an organization. Question and answer online social platforms provide an important source of information to explore the underlying communities. StackOverflow (SO) is one of the most popular question and answer platforms for developers, with more than 23 million questions asked. Organizing and categorizing data is crucial to manage knowledge in such large quantities. Questions posted on SO are assigned a set of tags and textual content of each question may contain coding syntax. In this paper, we evaluate the performance of multiple text representation methods in the task of predicting tags for SO questions and empirically prove the impact of code syntax in text representations. The SO dataset was sampled and questions without code syntax were identified. Two classical text representation methods consisting of BoW and TF-IDF were selected along four other methods based on pre-trained models including Fasttext, USE, Sentence-BERT and Sentence-RoBERTa. Multi-label k'th Nearest Neighbors classifier was used to learn and predict tags based on the similarities between feature-vector representations of the input data. Our results indicate a consistent superiority of the representations generated from Sentence-RoBERTa. Overall, the classifier achieved a 17% or higher improvement on F1 score when predicting tags for questions without any code syntax in content.

Original languageEnglish
Title of host publicationProceedings of the 56th Annual Hawaii International Conference on System Sciences, HICSS 2023
EditorsTung X. Bui
Pages585-594
Number of pages10
ISBN (Electronic)9780998133164
Publication statusPublished - 2023
Publication typeA4 Article in conference proceedings
EventHawaii International Conference on System Sciences - Maui, Hawaii, United States
Duration: 3 Jan 20236 Jan 2023

Publication series

NameProceedings of the Annual Hawaii International Conference on System Sciences
ISSN (Electronic)2572-6862

Conference

ConferenceHawaii International Conference on System Sciences
Country/TerritoryUnited States
CityMaui, Hawaii
Period3/01/236/01/23

Keywords

  • Knowledge-intensive work
  • Q&A forums
  • StackOverflow
  • Tag prediction
  • Text representation

Publication forum classification

  • Publication forum level 1

ASJC Scopus subject areas

  • General Engineering

Fingerprint

Dive into the research topics of 'Assessing Text Representation Methods on Tag Prediction Task for StackOverflow'. Together they form a unique fingerprint.

Cite this