Fact-checked by Grok 2 weeks ago

WaveNet

WaveNet is a deep generative developed by Aäron van den Oord, Sander Dieleman, and colleagues at for synthesizing raw audio waveforms, introduced in a 2016 research paper as a probabilistic, capable of producing highly natural-sounding speech and other audio signals. It operates by predicting each audio sample based on all preceding ones, using a of dilated convolutional layers to capture long-range dependencies in the waveform at a rate of 16,000 samples per second, which allows it to model complex audio patterns without relying on traditional higher-level representations like spectrograms. This approach enables WaveNet to generate speech that mimics specific human voices when conditioned on text or speaker identity inputs, marking a significant advancement in text-to-speech (TTS) synthesis. Unlike conventional TTS systems that use parametric or concatenative methods, WaveNet directly generates raw audio, resulting in superior naturalness and expressiveness; in blind mean opinion score (MOS) tests with over 500 human ratings, it reduced the quality gap to human speech by more than 50% for US English and Mandarin Chinese, outperforming state-of-the-art systems at the time. Its versatility extends beyond speech to applications like music generation, where it can produce piano performances preferred over real recordings by listeners in preference tests, and general audio modeling for diverse signals. WaveNet has been integrated into commercial products, notably powering Google Cloud Text-to-Speech since 2018, where it delivers lifelike, customizable voices supporting multiple languages and styles to enhance user interfaces and accessibility. Additionally, adaptations of the technology have facilitated voice restoration for individuals with speech impairments by cloning original voices from limited samples. Since its debut, WaveNet has served as a foundational influence on subsequent neural audio generation models, inspiring efficiency improvements like WaveRNN.

History and Development

Origins in Audio Generation

Prior to 2016, text-to-speech (TTS) synthesis primarily relied on two dominant approaches: concatenative and methods. Concatenative synthesis assembled pre-recorded speech units from a database to form utterances, offering relatively high naturalness by preserving original acoustic characteristics, but it suffered from limitations such as unnatural prosody due to challenges in modifying , , and intonation across units, as well as audible discontinuities at concatenation boundaries. synthesis, on the other hand, used statistical models to estimate acoustic parameters like spectral envelopes and , often employing vocoders such as to reconstruct waveforms; however, these systems frequently produced robotic or buzzy sounds because of over-smoothing in parameter estimation and difficulties in capturing the full complexity of natural speech variations. Both methods struggled with flexibility, such as adapting to new speakers or contexts without extensive retraining or recording, limiting their ability to generate highly expressive and human-like audio. By the mid-2010s, the proliferation of virtual assistants like Apple's (launched in ) and (introduced in 2016) heightened the demand for more realistic TTS to enhance user interaction and accessibility in consumer devices. These assistants required speech output that conveyed natural intonation and emotion to improve engagement, but existing synthesizers often resulted in mechanical-sounding responses that hindered perceived intelligence and usability. WaveNet emerged in 2016 from research at DeepMind, an AI laboratory acquired by Alphabet Inc. in 2014, as part of broader efforts in generative modeling. It built on autoregressive techniques from earlier works like PixelRNN, which modeled sequential data generation in images, adapting these principles to raw audio waveforms to address the shortcomings of traditional TTS. The foundational work was detailed in the paper "WaveNet: A Generative Model for Raw Audio" by Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu, published on arXiv on September 8, 2016. This publication introduced WaveNet as a probabilistic model capable of producing speech rated as more natural than prior parametric and concatenative systems in listener evaluations for English and Mandarin.

Key Publications and Milestones

WaveNet's development began with its seminal publication in September 2016, when researchers at DeepMind introduced the model in the paper "WaveNet: A for Raw Audio." This work demonstrated WaveNet's capability to generate raw audio waveforms using a probabilistic autoregressive approach, achieving superior naturalness in text-to-speech (TTS) synthesis. In subjective evaluations, WaveNet attained a (MOS) of 4.21 for English speech, outperforming parametric TTS systems ( 3.67) and concatenative methods ( 3.86), marking a significant advancement over prior models. In October 2017, Google announced the integration of an optimized into the for real-time TTS, initially supporting US English and across platforms. This deployment leveraged enhancements that accelerated inference by 1,000 times compared to the original model, enabling low-latency voice responses while preserving high-fidelity audio quality. The following month, in November 2017, DeepMind published "Parallel WaveNet: Fast High-Fidelity ," introducing Probability Density Distillation—a technique to train a parallel feed-forward network from the autoregressive WaveNet teacher model. This method achieved a 1,000-fold (over 500,000 timesteps per second on an P100 GPU) without significant quality degradation, maintaining an MOS of 4.41 comparable to the original WaveNet. By June 2018, further refinements appeared in "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis," developed by researchers with input from DeepMind, which adapted WaveNet vocoders for multispeaker TTS using from speaker verification tasks. This approach dramatically reduced the data requirements for voice cloning, enabling natural for new speakers with as little as 5 seconds of audio—compared to the tens of hours typically needed previously. Complementing this, the January 2019 ICLR paper "Sample Efficient Adaptive Text-to-Speech" extended WaveNet-based adaptation, allowing high-quality voice from mere seconds of target speech data through techniques that preserved speaker identity and prosody. Post-2019, WaveNet's core technology integrated deeply into Google's ecosystem, powering premium voices in Google Cloud Text-to-Speech since 2018, with ongoing expansions to over 380 voices across 75+ languages as of 2025. Its influence extended to hybrid systems like Tacotron 2 (2018), which conditioned WaveNet vocoders on mel-spectrogram predictions for end-to-end TTS, achieving scores above 4.0 and setting benchmarks for naturalness. By 2023–2025, WaveNet's waveform modeling principles informed multimodal audio capabilities in Google's models, enhancing generative audio outputs in text-image-audio pipelines. In June 2025, Google introduced native audio outputs in 2.5 models, leveraging neural synthesis techniques influenced by WaveNet for interactive applications. No major standalone WaveNet releases occurred after 2019, but its derivatives continued in DeepMind tools, such as WaveNetEQ (2020) for concealment in low-bandwidth video calls like , realistically missing audio segments to improve call quality. DeepMind's 2014 acquisition by facilitated these integrations, with WaveNet enabling advanced features in phones via enhanced voices. As of 2024, DeepMind's SynthID embeds watermarks in AI-generated audio (e.g., via Lyria) to detect synthetic content and track provenance.

Technical Architecture

Core Components and Waveform Modeling

WaveNet employs an autoregressive structure to generate raw audio waveforms sequentially, modeling the of the audio sequence x = (x_1, \dots, x_T) as p(x) = \prod_{t=1}^T p(x_t \mid x_{<t}), where each sample x_t is predicted conditioned only on the preceding samples x_{1:t-1}. This approach enables the model to capture temporal dependencies in audio signals directly at the level, without relying on intermediate representations like spectrograms. By predicting one sample at a time, WaveNet produces high-fidelity audio that exhibits natural prosody and , distinguishing it from traditional synthesizers. The core of WaveNet's architecture consists of stacked dilated causal convolutional layers, which ensure by restricting the to past inputs and using to expand the exponentially. Dilation rates typically increase across layers, such as 1, 2, 4, 8, and up to 512 in deeper stacks, allowing the model to access long-range dependencies spanning thousands of samples— for instance, a configuration with three stacks of 10 layers each, with up to 512, yields a covering approximately 3,000 samples (about 192 ms at 16 kHz). This design avoids the limitations of recurrent networks by leveraging parallelizable convolutions while maintaining the necessary temporal for coherent waveform generation. Raw audio waveforms serve as the input to WaveNet, sampled at 16 kHz and preprocessed through μ-law to quantize the signal into 256 discrete bins, applying the transformation f(x_t) = \sgn(x_t) \frac{\ln(1 + \mu |x_t|)}{\ln(1 + \mu)} with \mu = 255 to nonlinearly scale the and facilitate discrete modeling. At the output, the final layer applies a softmax over the 256 classes to produce a categorical for the next sample, enabling direct sampling from the predicted . In certain variants, a of approximates continuous outputs for improved quality.

Training Process and Probability Estimation

WaveNet is trained to maximize the log-likelihood of the by modeling the joint probability of the as an autoregressive product: p(\mathbf{x}) = \prod_{t=1}^T p(x_t \mid x_{<t}), where each conditional is parameterized by the network. During , is employed, feeding ground-truth previous samples to the model for parallel computation across timesteps, which enables efficient optimization using . The loss function is the categorical applied to the softmax outputs over the quantized audio samples, with hyperparameters tuned on a validation set to minimize . Training requires large datasets of raw audio waveforms, sampled at rates such as 16 kHz, encompassing thousands of hours of speech from diverse speakers; for instance, one large-scale experiment utilized thousands of hours from Google's traffic. To ensure numerical stability and reduce the output vocabulary, the audio is preprocessed using μ-law , which quantizes the continuous into 256 discrete levels, facilitating the softmax probability distribution over possible next samples. At time, audio generation proceeds autoregressively: the model samples the next audio value from its predicted distribution, conditioned on all prior samples, and feeds it back as input for the subsequent prediction, resulting in a strictly sequential process without parallelization. This leads to slow inference speeds in the original , taking approximately one minute to generate one minute of speech on a GPU. Probability estimation within WaveNet relies on a series of dilated convolutional blocks that process the input history to produce features fed into a final softmax layer. Each block employs gated activation functions to control information flow, computed as \mathbf{z} = \tanh(\mathbf{W}_{f,k} * \mathbf{x}) \odot \sigma(\mathbf{W}_{g,k} * \mathbf{x}), where * denotes , \odot is element-wise multiplication, and the tanh and gates modulate the features; this gating mechanism outperforms simpler activations like ReLU. To facilitate training of deep architectures comprising up to 30 layers across multiple stacks, residual connections add the block input to its output, while parameterized skip connections aggregate features from all blocks before the final layers, aiding gradient propagation and . Model performance is evaluated using log-likelihood scores to measure predictive fit on held-out data, alongside perceptual metrics such as (MOS) for naturalness, where WaveNet achieved scores exceeding 4.0 in text-to-speech tasks, and ABX preference tests, in which WaveNet was preferred over TTS systems in 70% of cases.

Advancements and Extensions

Efficiency Optimizations

The original autoregressive structure of WaveNet, which generates audio samples sequentially, posed significant computational challenges for real-time applications, often requiring minutes to produce seconds of speech. To overcome these bottlenecks, Probability Density Distillation was developed in 2017, training a compact student network to replicate the of a pretrained WaveNet model. This approach minimizes the divergence between the student's and teacher's output distributions, D_{\text{KL}}(P_S \parallel P_T) = H(P_S, P_T) - H(P_S), where H denotes and , respectively, while incorporating auxiliary losses for perceptual quality. The resulting model achieves over 1,000 times faster generation than the original WaveNet, producing up to 500,000 samples per second on a GPU without substantial quality loss, as evidenced by comparable mean opinion scores () in listening tests. Parallel WaveNet, introduced in the same work and presented at ICML 2018, further enables non-sequential inference by employing inverse autoregressive flows to predict multiple future waveform values in parallel from inputs, followed by to resolve dependencies. This inversion of the autoregressive process allows full GPU utilization, yielding synthesis rates exceeding 20 times on consumer hardware. Hardware accelerations have amplified these gains, with integration into Google's Cloud infrastructure enabling end-to-end speech generation in as little as 50 milliseconds per second of audio, facilitating deployment in production systems like . These optimizations maintain , with WaveNet voices scoring an average of 4.1 on a 1-5 scale—over 20% superior to non-neural alternatives—while trading minimal computational overhead for feasibility.

Voice Adaptation Techniques

Voice cloning techniques in WaveNet enable the synthesis of speech in a target 's voice by conditioning the model on speaker-specific embeddings, such as x-vectors derived from speaker verification systems or global style tokens. These embeddings capture unique vocal characteristics and are integrated into the WaveNet architecture during , allowing the model to adapt to new speakers with limited data. Through , where a pre-trained multi-speaker model is updated using only 10-50 minutes of target audio, voice cloning achieves high naturalness and speaker similarity, a marked reduction from the 20+ hours typically required in early systems. Content and voice swapping extends WaveNet's capabilities by disentangling linguistic content from speaker identity, preserving prosody and intonation during conversion. This is accomplished using variational autoencoders (VAEs) or adversarial training to separate representations, where the content (e.g., phonetic sequence) remains fixed while the speaker embedding is replaced. For example, the Disentangled Sequential Autoencoder framework models sequential data like audio by splitting latent representations into static content and dynamic components, enabling prosody-preserving swaps such as converting a male speaker to a female one without altering the spoken message. This approach, detailed in 2018 research, enables such voice conversions. To enhance data efficiency in voice adaptation, meta-learning methods train WaveNet-based text-to-speech (TTS) systems on diverse speakers, creating a adaptable that fine-tunes rapidly on minimal target . The sample-efficient adaptive TTS technique, for instance, uses episodic training to learn initialization parameters, allowing adaptation with as little as 5-10 minutes of audio while achieving mean opinion scores comparable to models trained on hours of . This near-original quality is maintained through optimized on speaker embeddings during . Voice adaptation in WaveNet also introduces ethical challenges, including the risk of audio for impersonation, , and , as cloned voices can convincingly mimic individuals . To address these, proposals include embedding imperceptible perturbations as watermarks in generated audio, which detect without audible degradation. Such watermarking techniques, leveraging adversarial perturbations in the , facilitate and have been advanced in recent to counter misuse in voice cloning scenarios, for example, AudioSeal developed in 2024 for localized detection of AI-generated speech.

Applications and Impact

Integration in Speech Synthesis

WaveNet has played a pivotal role in advancing text-to-speech (TTS) systems by serving as a high-fidelity that generates raw audio waveforms from acoustic features, enabling more natural-sounding compared to traditional parametric methods. Initially integrated into Google's ecosystem in October 2017, an optimized version of WaveNet powered the voices for in US English and , marking its first widespread deployment across platforms like Google Home. This integration extended to Google Cloud Text-to-Speech API, which launched in alpha in November 2017 with initial support for select languages and was fully powered by WaveNet technology by March 2018, initially offering voices in English, , and other variants. By 2020, the service had expanded to over 30 languages and variants, including additions like , , and multiple Indian languages in May 2020, while also powering WaveNet voices in for multilingual synthesis. In end-to-end TTS pipelines, WaveNet is typically combined with acoustic models such as Tacotron, introduced in December 2017, where Tacotron generates mel-scale spectrograms from text inputs, and WaveNet acts as the to convert these into time-domain waveforms. This architecture allows for direct text-to-waveform synthesis, bypassing intermediate linguistic features and achieving high naturalness with a (MOS) of 4.53—nearly matching professional recordings at 4.58—through streamlined conditioning on compact acoustic representations. Such pipelines have become foundational in modern TTS, with WaveNet's autoregressive modeling ensuring prosodic fidelity in generated speech. WaveNet's integration significantly elevated TTS performance , securing the highest scores for naturalness and speaker similarity in the Blizzard Challenge evaluation against prior systems. By , systems leveraging WaveNet continued to lead in blind listening tests, influencing industry standards and establishing it as a for premium TTS as of 2025. This impact extended to competitors, prompting to introduce neural TTS voices in in 2019, in competition with systems like WaveNet for more lifelike output. In real-world applications, WaveNet optimizations reduced synthesis latency to approximately 50 milliseconds per second of audio—over 1,000 times faster than the original model—enabling sub-200ms end-to-end response times in for seamless, conversational interactions. These improvements support natural dialogue flows in voice assistants, where low latency minimizes perceived delays in responses. For personalization, WaveNet-based techniques have facilitated voice cloning from short audio samples, allowing custom voices in Google Cloud TTS for applications like restoring speech for individuals with conditions such as .

Broader Audio and AI Uses

WaveNet's generative capabilities have extended beyond speech to and sound synthesis, notably through the NSynth project developed by DeepMind in 2017. NSynth employs a WaveNet-based to synthesize musical notes across various instruments, enabling the creation of novel sounds by interpolating between timbres in a continuous , which demonstrated superior qualitative and quantitative performance over spectral autoencoder baselines. This approach inspired subsequent models like OpenAI's in 2020, which generates full tracks with rudimentary in raw audio, leveraging a multi-scale variational (VQ-VAE) combined with autoregressive transformers to produce coherent multi-minute compositions in diverse genres. In communication and video applications, WaveNet facilitates bandwidth extension to improve audio quality in low-bandwidth scenarios. A 2019 DeepMind study introduced a WaveNet model for extending speech (up to 4 kHz) to (up to 8 kHz), achieving mean opinion scores comparable to human-recorded while operating in on resource-constrained devices. Such techniques enhance clarity in video calls and streaming, contributing to broader efforts. WaveNet has profoundly influenced the generative AI ecosystem, serving as a foundational autoregressive model for raw audio that paved the way for diffusion-based alternatives. For instance, DiffWave (2020) adopts a non-autoregressive diffusion process to synthesize speech, matching WaveNet's quality (MOS of 4.44) but with 1,000 times faster inference, highlighting WaveNet's role in establishing benchmarks for high-fidelity waveform generation. This legacy extends to models like Google's AudioLM (2022), which uses language modeling over discrete audio tokens for coherent long-form generation of speech and piano music from audio prompts alone. Similarly, Meta's MusicGen (2023) builds on these principles with a single language model operating on compressed music tokens to produce controllable tracks from text descriptions, achieving state-of-the-art controllability in genre and style. By 2025, WaveNet components underpin multimodal systems like Google's Gemini 2.5, enabling native audio-text generation for interactive applications such as dialogue synthesis and sound design. Research extensions of WaveNet have explored environmental sound modeling and tools. In environmental acoustics, WaveNet architectures have been adapted for anomalous sound event detection, using autoregressive prediction errors to identify deviations in waveforms like faults or noises, outperforming traditional classifiers on datasets such as DCASE challenges. For , WaveNet-powered text-to-speech integrates into captioning aids, providing natural voice feedback for visually impaired users in or reading apps, with Google's TTS leveraging WaveNet voices to support over 380 voices across more than 75 languages and variants as of 2025, including recent integrations like Gemini-TTS for enhanced controllability.

References

  1. [1]
    [1609.03499] WaveNet: A Generative Model for Raw Audio - arXiv
    Sep 12, 2016 · WaveNet is a deep neural network for generating raw audio waveforms. It is probabilistic and autoregressive, and can be used for text-to-speech.
  2. [2]
    WaveNet: A generative model for raw audio - Google DeepMind
    Sep 8, 2016 · This post presents WaveNet, a deep generative model of raw audio waveforms. We show that WaveNets are able to generate speech which mimics any human voice.
  3. [3]
    Introducing Cloud Text-to-Speech powered by DeepMind WaveNet ...
    Mar 27, 2018 · The new, improved WaveNet model generates raw waveforms 1,000 times faster than the original model, and can generate one second of speech in ...
  4. [4]
    Using WaveNet technology to reunite speech-impaired users with ...
    Dec 18, 2019 · First, we migrated from WaveNet to WaveRNN, which is a more efficient text to speech model co-developed by Google AI and DeepMind. WaveNet ...Using Wavenet Technology To... · Share · Building More...
  5. [5]
    Pushing the frontiers of audio generation - Google DeepMind
    Oct 30, 2024 · WaveNet: A generative model for raw audio. This post presents WaveNet, a deep generative model of raw audio waveforms. We show that WaveNets ...Pushing The Frontiers Of... · Pioneering Techniques For... · Scaling Our Audio Generation...
  6. [6]
    An overview of text-to-speech synthesis techniques - ResearchGate
    However, concatenative synthesis introduces the challenges of prosodic modification to speech units and resolving discontinuities at unit boundaries.Missing: limitations | Show results with:limitations
  7. [7]
    (PDF) Advances in AI-based Voice Synthesis - ResearchGate
    Mar 28, 2025 · robotic-sounding speech due to its inability to replicate natural human intonations. 2. Statistical Parametric Speech Synthesis (SPSS): A major ...
  8. [8]
    WaveNet launches in the Google Assistant - Google DeepMind
    Oct 4, 2017 · An updated version of WaveNet is being used to generate the Google Assistant voices for US English and Japanese across all platforms.Missing: 1000x | Show results with:1000x
  9. [9]
    Parallel WaveNet: Fast High-Fidelity Speech Synthesis - arXiv
    Nov 28, 2017 · Abstract page for arXiv paper 1711.10433: Parallel WaveNet: Fast High-Fidelity Speech Synthesis.Missing: date | Show results with:date
  10. [10]
    Transfer Learning from Speaker Verification to Multispeaker Text-To ...
    Jun 12, 2018 · We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers.Missing: adaptive cloning
  11. [11]
  12. [12]
    Text-to-Speech AI: Lifelike Speech Synthesis | Google Cloud
    ### Summary of WaveNet in Google Cloud Text-to-Speech
  13. [13]
    Natural TTS Synthesis by Conditioning WaveNet on Mel ... - arXiv
    Dec 16, 2017 · This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent ...
  14. [14]
    Improving Audio Quality in Duo with WaveNetEQ - Google Research
    Apr 1, 2020 · WaveNetEQ is a generative model, based on DeepMind's WaveRNN technology, that is trained using a large corpus of speech data to realistically continue short ...
  15. [15]
    Generating audio for video - Google DeepMind
    Jun 17, 2024 · V2A combines video pixels with natural language text prompts to generate rich soundscapes for the on-screen action.
  16. [16]
  17. [17]
    [1809.10460] Sample Efficient Adaptive Text-to-Speech - arXiv
    Sep 27, 2018 · We present a meta-learning approach for adaptive text-to-speech (TTS) with few data. During training, we learn a multi-speaker model using a shared conditional ...
  18. [18]
    Cloud Text-to-Speech expands its number of voices by nearly 70 ...
    Aug 27, 2019 · Voices in 11 new languages or variants, including Czech, English (India), Filipino, Finnish, Greek, Hindi, Hungarian, Indonesian, Mandarin ...Missing: 2020 | Show results with:2020
  19. [19]
    Cloud TTS release notes | Cloud Text-to-Speech
    May 01, 2020 ... Cloud Text-to-Speech now offers 36 new voices (both Standard and WaveNet) in the following languages. See the Supported Voices and Languages page ...Missing: integration | Show results with:integration
  20. [20]
    WaveNet - Google DeepMind
    WaveNet is a generative model trained on human speech samples. It creates waveforms of speech patterns by predicting which sounds are most likely to follow each ...Wavenet · The Challenge · Learning From Human Speech
  21. [21]
    [PDF] The Blizzard Challenge 2017 - ISCA Archive
    Aug 25, 2017 · The Blizzard Challenge 2017 was the thirteenth annual Blizzard. Challenge and was once again organised by Simon King at the. University of ...
  22. [22]
    Amazon launches Neural Text-To-Speech and newscaster style on ...
    Not to be outdone by Google's WaveNet, which mimics things like stress and intonation in speech by identifying tonal patterns, Amazon today ...Missing: influence | Show results with:influence
  23. [23]
    Google Text-To-Speech latency - Stack Overflow
    Sep 13, 2018 · According to the latency median on the metrics page for TTS, the latency is only 200ms which is far faster than what I am experiencing. If ...Google Cloud Text to Speech - Why is there a latency discrepancy ..."en-US-Wavenet-H" and "en-US-Wavenet-G" are not smooth ...More results from stackoverflow.comMissing: Assistant | Show results with:Assistant
  24. [24]
    Speech Generation after WaveNet - Andreas Kirsch
    Feb 13, 2018 · WaveNet has changed all this. First published in a research paper by DeepMind in 2016, it was launched in Google Assistant in September 2017.Missing: developments | Show results with:developments
  25. [25]
    Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
    Apr 5, 2017 · Using NSynth, we demonstrate improved qualitative and quantitative performance of the WaveNet autoencoder over a well-tuned spectral autoencoder ...Missing: DeepMind | Show results with:DeepMind
  26. [26]
    [2005.00341] Jukebox: A Generative Model for Music - arXiv
    We introduce Jukebox, a model that generates music with singing in the raw audio domain. We tackle the long context of raw audio using a multi-scale VQ-VAE.
  27. [27]
    [1907.04927] Speech bandwidth extension with WaveNet - arXiv
    Jul 5, 2019 · This paper proposes an approach where a communication node can instead extend the bandwidth of a band-limited incoming speech signal that may have been passed ...Missing: DeepMind | Show results with:DeepMind
  28. [28]
    DiffWave: A Versatile Diffusion Model for Audio Synthesis - arXiv
    Sep 21, 2020 · We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of ...
  29. [29]
    AudioLM: a Language Modeling Approach to Audio Generation - arXiv
    We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens.
  30. [30]
    [2306.05284] Simple and Controllable Music Generation - arXiv
    Jun 8, 2023 · We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, ie, tokens.
  31. [31]
    Advanced audio dialog and generation with Gemini 2.5 - The Keyword
    Gemini is built from the ground up to be multimodal, natively understanding and generating content across text, images, audio, video and code.Missing: WaveNet 2023
  32. [32]
    [PDF] Anomalous Sound Event Detection Based on WaveNet - EURASIP
    WaveNet has been used to precisely model a waveform signal and to directly generate it using random sampling in generation tasks, such as speech synthesis. On ...
  33. [33]
    Making AI-powered speech more accessible—now ... - Google Cloud
    Feb 22, 2019 · Thanks to unique access to WaveNet technology powered by Google Cloud TPUs, we can build new voices and languages faster and easier than is ...