Authors:
(1) Tamay Aykut, Sureel, Palo Alto, California, USA;
(2) Markus Hofbauer, Technical University of Munich, Germany;
(3) Christopher Kuhn, Technical University of Munich, Germany;
(4) Eckehard Steinbach, Technical University of Munich, Germany;
(5) Bernd Girod, Stanford University, Stanford, California, USA.
Table of Links
II. RELATED WORK
In this section, we summarize techniques for synthesizing acoustic audience feedback. Then, we give an overview of how integrating audience feedback in virtual events is handled currently.
A. Sound Synthesis
Traditional sound synthesis techniques can be separated into five categories: sample-based, physical modeling, signal modeling,abstract synthesis [9], and learning-based synthesis. More recently, deep-learning-based synthesis approaches have redefined the possibilities in sound synthesis [8].
In sample-based synthesis, audio recordings are cut and spliced together to produce new sounds. The most common example of this is granular synthesis [10]. A sound grain is generally a small element or component of a sound, typically between 10 ms to 200 ms in length. Concatenative synthesis is a subset of granular synthesis [11]. The goal is to select and recombine the grains in a way that avoids perceivable discontinuities.
Instead of using prerecorded audio data, physical modeling synthesis aims to model the underlying physical process of a sound system. Physical models require solving partial differential equations for each sample [12]. The resulting models are computationally intensive, requiring significant GPU resources to run in real time [13], [14].
In signal modeling synthesis, sounds are created based on an analysis of real-world sounds. The analyzed waveform is then used for synthesis. The most common method of signal modeling synthesis is Spectral Modeling Synthesis [15]. Spectral modeling can be approached by analyzing the original audio file, selecting a series of sine waves to be used for synthesis, and then combining it with a residual noise shape to produce the original sound [16].
In abstract synthesis, sounds are obtained using abstract methods and algorithms, typically to create entirely new sounds. An example is Frequency Modulation (FM) synthesis [17]. Two sine waves are multiplied together to create a more complex, richer sound that might not exist in the natural world. Early video game sounds were often based on FM synthesis. These sounds can be created and controlled in realtime due to the low complexity of the required process.
Finally, in deep-learning-based synthesis, large amounts of recordings are used to obtain a sound synthesis model in a data-driven way [18], [19]. Autoencoders have shown great promise for this task [20], both for music [21] and speech synthesis [22]. Architectures such as WaveNet [3] allow to learn a generative synthesis model directly from real-world recordings, generating significantly more natural sounds than parametric systems. While such models are complex and computationally expensive, recent architectures have increased the inference speed [23], [24], [25]. In 2023, Meta released Audiocraft, which includes text-to-sound systems such as AudioGen [8] or MusicGen [26]. These models allow turning natural text into arbitrary sound, or into music. For the proposed framework, this flexible language-based approach allows easily turning abstract audience feedback data into sound by turning the abstract data into a text prompt first.
B. Acoustic Feedback Synthesis
Next, we address specific implementations of sound synthesis for creating the most common acoustic feedback sounds of clapping, whistling, booing, and laughter.
Since the physical mechanism of Clapping is straightforward, synthesizing clapping sounds can be approached using physical modeling [27]. Whistling can be approached using abstract FM synthesis [28]. Booing can be generated using both abstract or sample-based synthesis [29].
The most complex and challenging sound to synthesize in a virtual audience is Laughter. Since an individual laughter is already a complex sound and additionally varies significantly from person to person, the most promising approaches for laughter synthesis are based on deep learning. Mori et al. [30] used WaveNet [3] to generate synthetic laughter that outperformed laughter synthesized using Hidden Markov Models (HMMs). They used the Online Gaming Voice Chat Corpus [31] to condition WaveNet, allowing them to control the amplitude envelope of the synthesized laughter. Despite the improved naturalness, the resulting laughter was still largely perceived as noisy and echoic.
Another approach used transfer learning from models trained with speech data to circumvent the problem of lack of laughter training data [6]. First, text-to-speech is trained and then fine-tuned with smiled speech and laughter data. MelGAN [32] is used to obtain the output waveform.
Finally, generative artificial intelligence methods such as Generative Spoken Language Modeling [33] or AudioGen [8] can be used to generate laughter from text prompt. Fine-tuning these models specifically for laughter is a promising direction for natural-sounding laughter synthesis.
This paper is available on arxiv under CC 4.0 license.