Spectral trancoder : using pretrained urban sound classifiers on undersampled spectral representations

Communications dans un congrès

Auteurs : Modan Tailleur, Mathieu Lagrange, Pierre Aumond, Vincent Tourre.

Conférence : 8th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE)

Date de publication : 2023

Convolutional Neural Network CNNGenerative algorithmThird-octave spectrogramMel spectrogramUrban soundscape

Lien vers le dépot HAL

Abstract

Slow or fast third-octave bands representations (with a frame resp. every 1-s and 125-ms) have been a de facto standard for urban acoustics, used for example in long-term monitoring applications. It has the advantages of requiring few storage capabilities and of preserving privacy. As most audio classification algorithms take Mel spectral representations with very fast time weighting (ex. 10ms) as input, very few studies have tackled classification tasks using other kinds of spectral representations of audio such as slow or fast third-octave spectra. In this paper, we present a convolutional neural network architecture for transcoding fast third-octave spectrograms into Mel spectrograms, so that it could be used as input for robust pre-trained models such as YAMNet or PANN models. Compared to training a model that would take fast third-octave spectrograms as input, this approach is more effective and requires less training effort. Even if a fast third-octave spectrogram is less precise both on time and frequency dimensions, experiments show that the proposed method still allows for classification accuracy of 62.4% on UrbanSound8k and 0.44 macro AUPRC on SONYC-UST.