Design of a Data Preprocessing Agent Program for Data Mining: Variability Viewpoint

Markus Vattulainen

Research output: Book/ReportDoctoral thesisCollection of Articles

Abstract

The objective of data preprocessing as part of data mining is to enable the execution of a data mining task such as classification and to support the achievement of high-quality data mining task outcomes. Preprocessing data to support the achievement of high-quality data mining task outcomes requires, first, the identification of multiple and possibly covarying problematic data quality characteristics (e.g., missing values, outliers, irrelevant features, duplicates), and second, finding the best combination of preprocessing techniques to apply. The combinatorial search problem is substantial, because there are alternative preprocessing techniques for each problematic data quality characteristic, and interaction effects of techniques applied in a sequence are common. Data preprocessing is, consequently, still often time-consuming manual work, and its outcomes remain suboptimal.

Preprocessing agents are a promising solution for preprocessing automation and optimization, but two gaps remain: first, system variability, or the ability to change or configure a system, and second, the evaluation of agent performance with real industrial data sets. The variability gap limits the generality of the agent programs as application domain- specific preprocessing requirements cannot be met without recoding. The evaluation gap leaves uncertainty regarding the viability of the agent approach and, thus, limits efforts to build and study such agents.

Design research is used as an approach and the problem is conceptualised as an artefact design task: that is, to construct a high variability data preprocessing agent design artefact that can be materialised as a computer program. The system variability viewpoint in the context of building the specific artefact is broken down into five sub-questions derived from variation point abstraction levels: Where is variability needed? What are the critical components to support variability? What are the internal variants of the critical components? What kinds of classes are needed to implement interfaces for variability? What runtime variability points are there?

The building phase characterises the problem domain (the agent task environment), and organises the preprocessing concepts to use related research contributions for the design task. For solution domain understanding, system components are designed and tested, and the existing agent models analysed for comparison. The evaluation phase addresses three questions: How well does it (the partially implemented system) work? Is it an improvement over the existing agent models? Is the design process justifiable?

A system component model (design artefact) is presented as a primary contribution to knowledge addressing the variability gap. Variability needs are identified when setting the preprocessing agent’s goal, how the agent perceives problematic data quality characteristics, setting preprocessing phases and techniques as actions of the agent, and when creating synthetic training data. A mechanism is defined for implementing variability as a single component that encapsulates the expected changes or configurations to the other system components.

Evidence of agent performance is provided from six industrial data sets from the business performance measurement system domain as a secondary contribution addressing the evaluation gap. The partially implemented design artefact achieved near-optimal (median 98% of the best preprocessing combination accuracy) results in classification tasks in 10% of an exhaustive search time by using a heuristic search from preprocessing combinations. Partially synthetic data sets were used to generalise beyond the six cases. A limitation is that execution time was, on average, 10 minutes for a heuristic search from 600 preprocessing combinations for small- and medium-size data sets with no data updates.

The results imply that a single component encapsulating changes or configurations to the system makes it easier to accommodate data preprocessing research contributions such as novel preprocessing techniques into a more generic, application domain-independent data preprocessing agent architecture. Data preprocessing agents can achieve high fitness of data with real data sets in a classification task, so efforts to build such agents for industrial use can be justified.
Original languageEnglish
Place of PublicationTampere
PublisherTampere University
ISBN (Electronic)978-952-03-2226-7
ISBN (Print)978-952-03-2225-0
Publication statusPublished - 2022
Publication typeG5 Doctoral dissertation (articles)

Publication series

NameTampere University Dissertations - Tampereen yliopiston väitöskirjat
Volume528
ISSN (Print)2489-9860
ISSN (Electronic)2490-0028

Fingerprint

Dive into the research topics of 'Design of a Data Preprocessing Agent Program for Data Mining: Variability Viewpoint'. Together they form a unique fingerprint.

Cite this