Preventing keystroke based identification in open data sets

Juho Leinonen, Petri Ihantola, Arto Hellas

    Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

    8 Citations (Scopus)

    Abstract

    Large-scale courses such as Massive Online Open Courses (MOOCs) can be a great data source for researchers. Ideally, the data gathered on such courses should be openly available to all researchers. Studies could be easily replicated and novel studies on existing data could be conducted. However, very fine-grained data such as source code snapshots can contain hidden identifiers. For example, distinct typing patterns that identify individuals can be extracted from such data. Hence, simply removing explicit identifiers such as names and student numbers is not sufficient to protect the privacy of the users who have supplied the data. At the same time, removing all keystroke information would decrease the value of the shared data significantly. In this work, we study how keystroke data from a programming context could be modified to prevent keystroke latency based identification whilst still retaining information that can be used to e.g. infer programming experience. We investigate the degree of anonymization required to render identification of students based on their typing patterns unreliable. Then, we study whether the modified keystroke data can still be used to infer the programming experience of the students as a case study of whether the anonymized typing patterns have retained at least some informative value. We show that it is possible to modify data so that keystroke latency based identification is no longer accurate, but the programming experience of the students can still be inferred, i.e. the data still has value to researchers. In a broader context, our results indicate that information and anonymity are not necessarily mutually exclusive.

    Original languageEnglish
    Title of host publicationL@S 2017 - Proceedings of the 4th (2017) ACM Conference on Learning at Scale
    PublisherACM
    Pages101-109
    Number of pages9
    ISBN (Electronic)9781450344500
    DOIs
    Publication statusPublished - 12 Apr 2017
    Publication typeA4 Article in conference proceedings
    EventACM Conference on Learning @ Scale -
    Duration: 1 Jan 2000 → …

    Conference

    ConferenceACM Conference on Learning @ Scale
    Period1/01/00 → …

    Keywords

    • Data anonymization
    • Data privacy
    • Keystroke dynamics
    • Programming experience inference
    • Source code snapshots

    Publication forum classification

    • Publication forum level 1

    ASJC Scopus subject areas

    • Computer Networks and Communications
    • Education
    • Software
    • Computer Science Applications

    Fingerprint

    Dive into the research topics of 'Preventing keystroke based identification in open data sets'. Together they form a unique fingerprint.

    Cite this