Journée GdR IASIS “Synthèse audio” à l’Ircam

(English version below)

Dans le cadre de l’axe « Audio, Vision, Perception » du GdR IASIS, nous organisons une journée d’études dédiée à la synthèse audio. La journée se tiendra le jeudi 7 novembre 2024 à l’Ircam (laboratoire STMS), à Paris.

À cette occasion, nous invitons quatre orateurs et oratrices, issu-e-s de la recherche française publique et privée :
– Fanny Roche, ARTURIA France
– Neil Zeghidour, Kyutai
– Judy Najnudel, UVI
– Philippe Esling, Ircam

Si vous souhaitez assister à la journée, sachez que l’inscription est gratuite mais obligatoire car les places sont limitées.

Lien d’inscription pour les membres du GDR IASIS : http://intranet.gdr-isis.fr/
(panneau “Identification”. Vous pouvez utiliser ce portail pour demander un défraiement de votre mission à Paris)

Lien d’inscription pour les non-membres du GDR IASIS :
http://intranet.gdr-isis.fr/index.php?page=inscription-individuelle-a-une-reunion-isis&idreunion=534

Après ces quatre présentations, la journée s’achèvera par une session poster ouverte à tous les chercheurs et chercheuses travaillant sur le traitement du signal audio en général. Ceci inclut la synthèse, mais aussi d’autres applications telles que : la détection, la classification, la régression, la transcription, l’alignement temporel, la structuration, la localisation, la spatialisation, la séparation de sources, le débruitage, la compression, la visualisation et le design d’interaction. Nous encourageons particulièrement les jeunes chercheur-e-s de niveau master ou doctorat à présenter leurs travaux, même si ceux-si sont encore à paraître, en cours de révision, ou en préparation. 

Si vous souhaitez participer à la session poster, merci de m’écrire avant le 10 octobre à vincent.lostanlen (@) cnrs.fr

En préparation de la journée, il vous sera demandé un titre ainsi qu’un “audiocarnet” : c’est-à-dire, un fichier audio d’une minute environ donnant à entendre les sons sur lesquels porte votre poster. Vous mettrez votre audiocarnet sur la plateforme HAL avant le 7 novembre. Pour des exemples d’audiocarnets déjà parus, voir : https://hal.science/AUDIOCARNET 

09:30 Accueil et café
10:00 Introduction

10:15 Fanny Roche : Music sound synthesis using machine learning

One of the major challenges of the synthesizer market and sound synthesis today lies in proposing new forms of synthesis allowing the creation of brand new sonorities while offering musicians more intuitive and perceptually meaningful control to help them find the perfect sound more easily. Indeed, today’s synthesizers are very powerful tools offering musicians a wide range of possibilities for creating sound textures, but the control of parameters still lacks user-friendliness and generally requires expert knowledge to manipulate. This presentation will focus on machine learning methods for sound synthesis, enabling the generation of new, high-quality sounds while providing perceptually relevant control parameters. 

In a first part of this talk, we will focus on the perceptual characterization of synthetic musical timbre by highlighting a set of verbal descriptors frequently and consensually used by musicians. Secondly, we will explore the use of machine learning algorithms for sound synthesis, and in particular different models of the “autoencoder” type, for which we have carried out an in-depth comparative study on two different datasets. Then, this presentation will focus on the perceptual regularization of the proposed model, based on the perceptual characterization of synthetic timbre presented in the first part, to enable (at least partial) perceptually relevant control of sound synthesis. Finally, in the last part of this talk, we will quickly present some of the latest tests we conducted using more recent neural synthesis models.

11:15 Neil Zeghidour : Audio Language Models

Audio analysis and audio synthesis require modeling long-term, complex phenomena and have historically been tackled in an asymmetric fashion, with specific analysis models that differ from their synthesis counterpart. In this presentation, we will introduce the concept of audio language models, a recent innovation aimed at overcoming these limitations. By discretizing audio signals using a neural audio codec, we can frame both audio generation and audio comprehension as similar autoregressive sequence-to-sequence tasks, capitalizing on the well-established Transformer architecture commonly used in language modeling. This approach unlocks novel capabilities in areas such as textless speech modeling, zero-shot voice conversion, text-to-music generation and even real-time spoken dialogue. Furthermore, we will illustrate how the integration of analysis and synthesis within a single model enables the creation of versatile audio models capable of handling a wide range of tasks involving audio as inputs or outputs. We will conclude by highlighting the promising prospects offered by these models and discussing the key challenges that lie ahead in their development.

12:15 Repas

14:00 Judy Najnudel : Grey-box modelling informed by physics: Application to commercial digital audio effects.

In an ever-expanding and competitive market, commercial digital audio effects have significant constraints. Their computation load must be reduced so that they can be operable in real-time. They must be easily controllable through parameters that should be scarce and relate to clear features. They must be robust and safe for large combinations of inputs and controls to allow for user creativity as well. Effects based on existing systems (acoustic or electronic devices) must in addition sound realistic and capture expected idiosyncrasies. 

For this last category of effects, a full physical model is not always available or even desirable, as it can be both too complex to run and be used efficiently. In this talk, we explore grey-box approaches that combine strong physically-based priors and identification from measurement data. The priors impose a model structure that preserves fundamental properties such as passivity and dissipativity, while measurements allow to bridge possible gaps in the model. This produces reduced, macroscopic, power-balanced models of complex physical systems that can be fitted to data, and result in numerically stable simulations. This approach is illustrated on real electronic components and circuits, with audio demonstrations of the corresponding effects to complete the presentation.

15:00 Philippe Esling : AI in 64Kbps: Lightweight neural audio synthesis for embedded instruments

16:00 Présentation brève des posters en salle
16:30 Posters et café
17:30 Clôture

Comité d’organisation :
– Mathieu Lagrange (LS2N, CNRS)
– Thomas Hélie (STMS, Ircam, CNRS)
– Vincent Lostanlen (LS2N, CNRS)

L’évènement sera retransmis sur Zoom : https://us06web.zoom.us/j/84725252571?pwd=0LxXknkxb3NoKKpaUZaF7cNxfh9bUx.1

Merci de partager cette annonce à toute personne intéressée et notamment à vos étudiant-e-s susceptibles de présenter un poster.


As part of the CNRS special interest group on signal and image processing (GdR IASIS), we are organizing a 1-day workshop on audio synthesis at Ircam on November 7th, 2024.

On this occasion, we have four keynote speakers from French research institutions, both public and private:

• Fanny Roche, ARTURIA France 
• Neil Zeghidour, Kyutai 
• Judy Najnudel, UVI 
• Philippe Esling, Ircam

If you wish to attend, note that registration is free but mandatory since we have limited seating capacity.

If your institution is a member of GDR IASIS, register via the Intranet: http://intranet.gdr-isis.fr/

(“Identification” panel. You may use this web portal to request GDR IASIS to cover your travel expenses. Ask me directly if you have any questions regarding travel arrangements)

Otherwise, register using this link: http://intranet.gdr-isis.fr/index.php?page=inscription-individuelle-a-une-reunion-isis&idreunion=534

After these four presentations, the day will end with a poster session open to all researchers working on audio signal processing in general. This includes synthesis, but also other applications such as: detection, classification, regression, transcription, temporal alignment, structuring, localization, spatialization, source separation, denoising, compression, visualization and interaction design. We particularly encourage young researchers at master or doctoral level to present their work, even if it is yet to be published, under revision, or in preparation.

If you wish to participate in the poster session, please write to me before October 10 at vincent.lostanlen (@) cnrs.fr

In preparation for the day, you will be asked for a title and an “audiocarnet”: that is, an audio file of about one minute giving the sounds on which your poster is based. You will put your audiocarnet on the HAL platform before November 7. For examples of audiocarnets already published, see: https://hal.science/AUDIOCARNET 

Programme

09:30 Welcome and coffee

10:00 Introduction

10:15 Fanny Roche : Music sound synthesis using machine learning

One of the major challenges of the synthesizer market and sound synthesis today lies in proposing new forms of synthesis allowing the creation of brand new sonorities while offering musicians more intuitive and perceptually meaningful control to help them find the perfect sound more easily. Indeed, today’s synthesizers are very powerful tools offering musicians a wide range of possibilities for creating sound textures, but the control of parameters still lacks user-friendliness and generally requires expert knowledge to manipulate. This presentation will focus on machine learning methods for sound synthesis, enabling the generation of new, high-quality sounds while providing perceptually relevant control parameters. 

In a first part of this talk, we will focus on the perceptual characterization of synthetic musical timbre by highlighting a set of verbal descriptors frequently and consensually used by musicians. Secondly, we will explore the use of machine learning algorithms for sound synthesis, and in particular different models of the “autoencoder” type, for which we have carried out an in-depth comparative study on two different datasets. Then, this presentation will focus on the perceptual regularization of the proposed model, based on the perceptual characterization of synthetic timbre presented in the first part, to enable (at least partial) perceptually relevant control of sound synthesis. Finally, in the last part of this talk, we will quickly present some of the latest tests we conducted using more recent neural synthesis models.

11:15 Neil Zeghidour : Audio Language Models

Audio analysis and audio synthesis require modeling long-term, complex phenomena and have historically been tackled in an asymmetric fashion, with specific analysis models that differ from their synthesis counterpart. In this presentation, we will introduce the concept of audio language models, a recent innovation aimed at overcoming these limitations. By discretizing audio signals using a neural audio codec, we can frame both audio generation and audio comprehension as similar autoregressive sequence-to-sequence tasks, capitalizing on the well-established Transformer architecture commonly used in language modeling. This approach unlocks novel capabilities in areas such as textless speech modeling, zero-shot voice conversion, text-to-music generation and even real-time spoken dialogue. Furthermore, we will illustrate how the integration of analysis and synthesis within a single model enables the creation of versatile audio models capable of handling a wide range of tasks involving audio as inputs or outputs. We will conclude by highlighting the promising prospects offered by these models and discussing the key challenges that lie ahead in their development.

12:15 Lunch

14:00 Judy Najnudel : Grey-box modelling informed by physics: Application to commercial digital audio effects.

In an ever-expanding and competitive market, commercial digital audio effects have significant constraints. Their computation load must be reduced so that they can be operable in real-time. They must be easily controllable through parameters that should be scarce and relate to clear features. They must be robust and safe for large combinations of inputs and controls to allow for user creativity as well. Effects based on existing systems (acoustic or electronic devices) must in addition sound realistic and capture expected idiosyncrasies.

For this last category of effects, a full physical model is not always available or even desirable, as it can be both too complex to run and be used efficiently. In this talk, we explore grey-box approaches that combine strong physically-based priors and identification from measurement data. The priors impose a model structure that preserves fundamental properties such as passivity and dissipativity, while measurements allow to bridge possible gaps in the model. This produces reduced, macroscopic, power-balanced models of complex physical systems that can be fitted to data, and result in numerically stable simulations. This approach is illustrated on real electronic components and circuits, with audio demonstrations of the corresponding effects to complete the presentation.

15:00 Philippe Esling : AI in 64Kbps: Lightweight neural audio synthesis for embedded instruments

16:00 Poster spotlights

16:30 Poster session and café

17:30 End

Steering commitee 
– Mathieu Lagrange (LS2N, CNRS)
– Thomas Hélie (STMS, Ircam, CNRS)
– Vincent Lostanlen (LS2N, CNRS)

Zoom link for remote attendees: https://us06web.zoom.us/j/84725252571?

Please share this announcement with anyone interested, especially those of your students who might be willing to present a poster.