IACyC Proceedings - Ensemble-Based Machine Learning Models for Cybersecurity: Theoretical Guarantees and Empirical Insights

Conference papers

Authors

Zhivko Atanaskoski , Stefan Mirchevski , Vesna Dimitrova and Aleksandra Popovska-Mitrovikj

Abstract

This paper provides a comparative study of ensemble classifiers: Random Forests, Rotation Forests, Bagging, AdaBoost, and Gradient Boosted Trees, in cybersecurity classification tasks. We first outline key theoretical aspects of these methods, including their bias-variance behavior, risk bounds, and relation to the Bayes optimal classifier. Building on this overview, we conduct an experimental analysis using a malware dataset, where the empirical results are interpreted in light of these theoretical properties, particularly with respect to excess risk. The framework incorporates feature preprocessing and multiple evaluation metrics (ROC AUC, PR AUC, accuracy, precision, recall, and F1 score). Results highlight the strong bias-variance balance of Random Forests and Gradient Boosted Trees, as well as the decorrelation advantage of Rotation Forests. The study offers practical guidance for applying ensemble learning to vulnerability analysis. To our knowledge, this is a comparative study that grounds ensemble classifiers on CIC-MalMem-2022 within a theoretical risk framework, while also extracting feature- level insights relevant for operational malware detection. This integration highlights not only which models perform best, but why they succeed in adversarial cybersecurity settings.

Keywords

malware classification, ensemble methods, machine learning models, excess risk, convergence guarantees, cybersecurity