IACyC Proceedings - Robust Environmental Sound Classification via CNNs on a Unified, Imbalance-Aware Audio Dataset

Conference papers

Authors

Rim Tafech , Subrahmanya Rajesh Nayak , Vinay Vardhan Reddy Eega , Made Praveen Sombathina and Klaus Schwarz

Abstract

Environmental sound classification (ESC) is important for systems like smart cities, security, and wildlife tracking. Older methods that use handcrafted features such as Mel- Frequency Cepstral Coefficients (MFCCs) often fail to handle the wide variety and complexity found in real-life sounds. These methods also have trouble adapting to new and different environments. Deep learning, especially using Convolutional Neural Networks (CNNs) on spectrograms, has improved results in this field. However, issues still exist in making models that work well across many different sound settings and can deal with the imbalance of sound classes in real-world data. This work introduces a strong ESC system to solve these problems. A large and mixed dataset was created by combining UrbanSound8K [3], ESC-50 [2], and VocalSound [1], giving over 22,000 well-balanced samples from 59 types of environmental and vocal sounds. These audio clips were turned into 128-bin Mel-spectrograms and used as inputs for a modified ResNet18 CNN model. To make the model more reliable and to handle data imbalance, time and frequency masking like SpecAugment was used for data augmentation, and a class-weighted Focal Loss function was added during training. The final model reached an accuracy of 91.43% on a test set it had never seen before. Results show that the system can handle a wide range of sound types and performs well even with less common sound classes. This study proves that combining multiple datasets and using advanced deep learning methods can build a high-performing and general ESC system.

Keywords

Environmental Sound Classification, Deep Learning, Convolutional Neural Networks, Data Augmentation, Focal Loss, Class Imbalance, ResNet