Understanding the Process of Data Labeling in Cybersecurity
Braun, T., Pekaric, I., Apruzzese, G., ACM Symposium on Applied Computing, 2024 Conference
Oneliner: Nobody ever questioned "how labelling is done by cybersecurity practitioners". We try to uncover this mystery.
Abstract. Many domains now leverage the benefits of Machine Learning (ML), which promises solutions that can autonomously learn to solve complex tasks by training over some data. Unfortunately, in cyberthreat detection, high-quality data is hard to come by. Moreover, for some specific applications of ML, such data must be labeled by human operators. Many works “assume” that labeling is tough/challenging/costly in cyberthreat detection, thereby proposing solutions to address such a hurdle. Yet, we found no work that specifically addresses the process of labeling from the viewpoint of ML security practitioners. This is a problem: to this date, it is still mostly unknown how labeling is done in practice—thereby preventing one from pinpointing ``what is needed’’ in the real world.
In this paper, we take the first step to build a bridge between academic research and security practice in the context of data labeling. First, we reach out to five subject-matter experts and carry out open interviews to identify pain points in their labeling routines. Then, by using our findings as a scaffold, we conduct a user study with 13 practitioners from large security companies, and ask detailed questions on subjects such as active learning, costs of labeling, and revision of labels. Finally, we perform proof-of-concept experiments addressing labeling-related aspects in cyberthreat detection that are sometimes overlooked in research. Altogether, our contributions and recommendations serve as a stepping stone to future endeavors aimed at improving the quality and robustness of ML-driven security systems. We release our resources.