We are looking to recruit a postdoc as part of the ANR project on multi-resolution neural networks (MuReNN). The goal is to work towards more efficient and interpretable models for deep learning in audio.
Author: Vincent Lostanlen
Residual Hybrid Filterbanks @ IEEE SSP
A hybrid filterbanks is a convolutional neural network (convnet) whose learnable filters operate over the subbands of a non-learnable filterbank, which is designed from domain knowledge. While hybrid filterbanks have found successful applications in speech enhancement, our paper shows that they remain susceptible to large deviations of the energy response due to randomness of convnet weights at initialization. Against this issue, we propose a variant of hybrid filterbanks, by inspiration from residual neural networks (ResNets). The key idea is to introduce a shortcut connection at the output of each non-learnable filter, bypassing the convnet. We prove that the shortcut connection in a residual hybrid filterbank lowers the relative standard deviation of the energy response while the pairwise cosine distances between non-learnable filters contributes to preventing duplicate features.
Podcast “L’éco-acoustique” sur Le Labo des Savoirs
Et si écouter littéralement la nature nous renseignait sur l’état de notre biodiversité ? Un podcast de Sophie Podevin avec Jérôme Sueur, Flore Samaran et Vincent Lostanlen.
Robust Deconvolution with Parseval Filterbanks @ IEEE SampTA
This article introduces two contributions: Multiband Robust Deconvolution (Multi-RDCP), a regularization approach for deconvolution in the presence of noise; and Subband-Normalized Adaptive Kernel Evaluation (SNAKE), a first-order iterative algorithm designed to efficiently solve the resulting optimization problem. Multi-RDCP resembles Group LASSO in that it promotes sparsity across the subband spectrum of the solution. We prove that SNAKE enjoys fast convergence rates and numerical simulations illustrate the efficiency of SNAKE for deconvolving noisy oscillatory signals.
Human Auditory Ecology @ MNHN
Can we hear “ecological processes” underlying natural habitats and ecosystems (i.e., the processes responsible for the dynamics and functions of ecological systems at multiple spatial and temporal scales) ? If so, how do we hear such ecological processes ?
Le streaming comme infrastructure et comme mode de vie @ RNRM
L’enquête sur l’impact écologique du streaming musical révèle deux angles d’analyse : l’un fondé sur l’infrastructure matérielle, l’autre sur l’évolution des modes de vie. À l’heure où les architectures de choix sont de plus en plus verrouillées autour d’un petit nombre de géants du numérique, l’enjeu de cette enquête réside dans une complémentarité entre méthodes quantitatives et méthodes qualitatives, ainsi que dans une interdisciplinarité entre sciences du numérique, sciences humaines et sociales et sciences du système Terre. Dans ce contexte, critiquer l’insoutenabilité du streaming ne signifie pas s’en remettre à une innovation technologique qui pourrait soudain « verdir » la filière dans son ensemble. Bien plutôt, il s’agit de dénoncer et contester l’utopie d’une musique intégralement disponible, pour tout le monde, partout, tout de suite. Pour se rendre crédibles, les scénarios alternatifs au statu quo doivent définir, dans un même geste technocritique, quel mode de vie ils promeuvent et quelle infrastructure ils maintiendront.
Robust Multicomponent Tracking of Ultrasonic Vocalizations @ IEEE ICASSP
Ultrasonic vocalizations (USV) convey information about individual identity and arousal status in mice. We propose to track USV as ridges in the time-frequency domain via a variant of timefrequency reassignment (TFR). The key idea is to perform TFR with empirical Wiener shrinkage and multitapering to improve robustness to noise. Furthermore, we perform TFR over both the short-term Fourier transform and the constant-Q transform so as to detect both the fundamental frequency and its harmonic partial (if any). Experimental results show that our approach effectively estimates multicomponent ridges with high precision and low frequency deviation.
Invited talk: Constance Douwes
In this seminar, we propose a shift towards energy-aware model evaluation. Using a Pareto-optimal framework, we advocate for balancing performance with energy efficiency through an extended analysis of deep generative models for speech synthesis. Furthermore, we refine energy consumption measurements by studying elementary neural network architectures, highlighting complex relationships between energy consumption, the number of operations, and hardware dependencies. Finally, as organizers of the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge, we analyze the impact of introducing an energy criterion on the challenge results and explore the evolution of system complexity and energy consumption over the years.
S-KEY: Self-Supervised Learning of Major and Minor Keys from Audio @ IEEE ICASSP
STONE, the current method in self-supervised learning for tonality estimation in music signals, cannot distinguish relative keys, such as C major versus A minor. In this article, we extend the neural network architecture and learning objective of STONE to perform self-supervised learning of major and minor keys (S-KEY). Our main contribution is an auxiliary pretext task to STONE, formulated using transposition-invariant chroma features as a source of pseudo-labels. S-KEY matches the supervised state of the art in tonality estimation on FMAKv2 and GTZAN datasets while requiring no human annotation and having the same parameter budget as STONE. We build upon this result and expand the training set of S-KEY to a million songs, thus showing the potential of large-scale self-supervised learning in music information retrieval.
Création “in earth we walk” @ Halle 6
A live performance for voice, live electronics, and double bass. Created by Han Han
In earth we walk is a fleeting moment where voices become agents for constructing nature-inspired landscapes: voices utter semantically charged words conveying vivid scenarios; voices supply raw sonic material that are treated as pure sounds. The libretto is a six-stanza poem that unfolds a series of pictorial and psychological scenes, exploring themes of longing, awe and the reckoning with impermanence. Together, vocal emulations of clouds, torrent, winds, tides and sands weave into a sonic experience that evokes one’s multifaceted relationship with the many wonders and situations earth puts one in.