Siirry päänavigointiin Siirry hakuun Siirry pääsisältöön

SQuaD: The Software Quality Dataset - Dataset

  • Mikel Robredo (Creator)
  • Matteo Esposito (Creator)
  • Davide Taibi (Creator)
  • Rafael Penaloza (Creator)
  • Valentina Lenarduzzi (University of Oulu) (Creator)

Tietoaineisto

Kuvaus

This is a re-direction Zenodo repository that presents the "SQuaD: The Software Quality Dataset" submitted to MSR 2026 Data and Tool Showcase Track, and provides the link address to each of the supplementary materials (see below).

Version: 1.0 DOI: https://doi.org/10.5281/zenodo.17566690
Authors: Mikel Robredo, Matteo Esposito, Davide Taibi, Rafael Peñaloza, Valentina Lenarduzzi
Affiliations: University of Oulu, University of Southern Denmark, University of Milano-Bicocca

Access and Usage

The dataset and all supplementary materials are available through Zenodo and IDA* repositories:

CSV Raw Data (IDA): https://doi.org/10.23729/fd-c528d131-2c8c-3e61-91f1-a075931e73dc

MongoDB BSON (IDA): https://doi.org/10.23729/fd-f9dc7d2c-0465-3991-961f-56128ee518d0

Replication Package (Zenodo):https://doi.org/10.5281/zenodo.17541471

On IDA: IDA (ida.fairdata.fi) is a research data storage service organized by the Finnish Ministry of Education and Culture and produced by CSC — IT Center for Science. The service is intended for storing stable research data, both raw data and processed data, which is included to research datasets published in the FAIRdata (FAIR: Findable, Accessible, Interoperable, and Reusable) Etsin service. The service is offered free of charge to users affiliated with Finnish universities and polytechnics and Finnish research institutes.

Each link corresponds to a specific data access format, along with replication scripts and diagrams for database structure.

Main abbreviations:

Static Analysis Tool (SAT): A software static analysis tool is an automated program that examines a software's source code without executing it to find potential bugs, security vulnerabilities, and deviations from coding standards.

Issue Tracking System (ITS): A software issue tracking system is a tool used to manage and track software bugs, feature requests, and other problems from initial report to final resolution. It acts as a centralized database, allowing teams to create, assign, and monitor issues, ensuring a structured and organized approach to problem-solving and collaboration.

Overview

The Software Quality Dataset (SQuaD) is a multi-dimensional, time-aware collection of software quality metrics extracted from 450 mature open-source projects across diverse ecosystems, including Apache, Mozilla, FFmpeg, and the Linux kernel.

SQuaD integrates nine state-of-the-art Static Analysis Tools (SATs) and combines both product and process metrics to support large-scale empirical research on software quality, maintainability, evolution, and technical debt.

This dataset was submitted to a major software engineering conference in 2025 and is the result of a seven-month large-scale mining effort.

Dataset Summary

Attribute
Description

Projects analyzed
450 open-source projects

Releases analyzed
63,586 releases/tags

Static Analysis Tools
9 tools (SonarQube, CodeScene, PMD, Understand, CK, JaSoMe, RefactoringMiner, RefactoringMiner++, PyRef)

Unique metrics
725 metrics

Defect tickets
628,178

Commits analyzed
2,622,413

Detected vulnerabilities
1,479 CVEs and 175 CWEs

Average project age
9 years

Average LOC per project
125,500

Average GitHub stars
2,465

Average contributors
104

Data Contents

The dataset includes a variety of entities and metric tables, covering product, process, and vulnerability information.Each entity corresponds to a CSV table or a MongoDB collection:

Table
Description

PROJECTS
GitHub repository metadata

COMMITS
Commit hash, message, date, author alias

ISSUES
Issue tickets from GitHub, Jira, and Bugzilla

RELEASES
Identifiers of project releases and related commit hashes

GITHUB_METRICS
Stars, contributors, watchers, and project statistics

PRJ_ITS_VLN_LINKAGE
Links between projects, issue trackers, and detected vulnerabilities

CVE / CWE
Official vulnerability and weakness data from NIST and MITRE

PROCESS_METRICS
14 process metrics computed for each release

TOOL tables
Output metrics from each SAT at method, class, file, and project levels

Available Formats

SQuaD is distributed in two complementary formats to facilitate different research and analysis needs:

1. CSV Format

Each entity is provided as a separate CSV file.

Ideal for direct exploration, statistical analysis, and integration into scripts or notebooks.

Mirrors the same relational structure as the MongoDB database.

2. MongoDB Format

A NoSQL version of the dataset is provided as a compressed BSON dump (Zstandard-compressed).

Can be imported into MongoDB for scalable querying and time-aware analyses.

Recommended for researchers dealing with large-scale data analytics or custom pipelines.

NOTE: - The full data weighs approximately 1.9 TB, so ensure sufficient storage and RAM before extraction and import.

Step 1 — Decompress the Archive (Zstandard)

The dataset is distributed as a .tar.zst file. To extract it, install Zstandard and decompress as follows:

# Install Zstandard (if not already installed)
sudo apt install zstd

# Decompress the archive (this may take several hours)
unzstd SQuaD_MongoDB_Dump.tar.zst

# Extract the BSON dump files
tar -xvf SQuaD_MongoDB_Dump.tar


Step 2 — Import into MongoDB

Once decompressed, you can import each collection using mongorestore (bundled with MongoDB tools):

# Example: restore entire database
mongorestore --db squad_db /path/to/SQuaD_MongoDB_Dump

Methodology Overview

The dataset construction follows four key stages (illustrated in the paper’s Figure 1):


Mining version control data

Cloned 501 repositories (filtered to 450 active, mature projects).

Retrieved commits, tags, issues, and metadata from issue tracking systems (ITS) such as GitHub, Jira, and Bugzilla.

Mining software quality metrics

Applied nine SATs in parallel across all releases.

Extracted metrics at multiple granularity levels (method, class, file, project).

Extracting vulnerabilities

Parsed CVE and CWE references from issue tickets.

Fetched official vulnerability descriptions via NIST and MITRE APIs.

Collecting process metrics

Computed 14 release-level process metrics (e.g., churn, contributor count, commit density) using GitPython.

Research Opportunities

SQuaD provides a comprehensive foundation for a variety of software engineering research domains:

Software evolution and maintainability analysis

Defect prediction and Just-In-Time learning

Technical debt and code smell benchmarking

Refactoring impact analysis

Software vulnerability detection and risk assessment

Transformer-based and AI-driven quality modeling


Its combination of product and process metrics supports both statistical and machine learning–based investigations.

Acknowledgments

This work was supported by:

CSC – IT Center for Science, Finland (Mahti Supercomputer, Allas Cloud Storage, cPouta services)

FAST Doctoral Research Network, funded by the Finnish Ministry of Education and Culture

SciTools, for providing academic support and licenses for Understand
Koska saatavilla8 marrask. 2025
JulkaisijaZenodo

Field of science, Statistics Finland

  • 113 Tietojenkäsittely ja informaatiotieteet

Siteeraa tätä