Generation or Replication: Auscultating Audio Latent Diffusion Models

Audio examples from ICASSP 2024 submission

MERL Researchers: Gordon Wichern, François Germain, Zexu Pan, Sameer Khurana, Chiori Hori, Jonathan Le Roux (Speech & Audio).

Search MERL publications by keyword: Speech & Audio, acoustic similarity, audio synthesis,


Identified partially replicated training examples from the full TANGO model.

For each generated example we show the top match found in the training set for both similarity methods explored in our paper - CLAP and mel. While the generated sounds are not identical to the training data, they have striking similarities in terms of features such as event onsets, which appear to be replicated from the training data.

Generated Sample

Prompt: Something zooms by before exploding in the distance
Top Training Data Match: mel

Caption: Explosions occur multiple times
Top Training Data Match: CLAP

Caption: Multiple explosions
     
Generated Sample

Prompt: A man speaks and an audience gives applause
Top Training Data Match: mel

Caption: Person is speaking and people are cheering
Top Training Data Match: CLAP

Caption: Excitement and applause for a male speaker
     
Generated Sample

Prompt: A motor is accelerating and then slows, then accelerates again
Top Training Data Match: mel

Caption: A train moves then a horn is triggered and a bell rings
Top Training Data Match: CLAP

Caption: A racing vehicle engine revving up before accelerating and driving by
     
Generated Sample

Prompt: A person sneezing
Top Training Data Match: mel

Caption: A loud burp is made
Top Training Data Match: CLAP

Caption: A child sneezes
     
Generated Sample

Prompt: A person is snoring while sleeping
Top Training Data Match: mel

Caption: A person snoring
Top Training Data Match: CLAP

Caption: A person snores loudly nearby several times
     
Generated Sample

Prompt: A door opens then closes followed by thunder
Top Training Data Match: mel

Caption: Thunder sounds loudly nearby
Top Training Data Match: CLAP

Caption: A click followed by a loud, long bang
     
Generated Sample

Prompt: Silence followed by breathing, a sneeze then sniffling
Top Training Data Match: mel

Caption: A person sneezes loudly nearby
Top Training Data Match: CLAP

Caption: An adult female sneeze three times and sniffs
     
Generated Sample

Prompt: The gentle drone of a fan blows with an echo as a toilet flushes
Top Training Data Match: mel

Caption: A toilet is flushed
Top Training Data Match: CLAP

Caption: A toilet is flushed and the water gurgles loudly
     
Generated Sample

Prompt: A power tool is in use
Top Training Data Match: mel

Caption: High pitched drilling
Top Training Data Match: CLAP

Caption: Drill spinning rapidly and then getting stuck and stopping
     
Generated Sample

Prompt: A person snoring and breathing heavily
Top Training Data Match: mel

Caption: A person snoring
Top Training Data Match: CLAP

Caption: A person snoring
     
Generated Sample

Prompt: A man speaks, and then a toilet flushes, followed by the man continuing to speak
Top Training Data Match: mel

Caption: A winged insect is buzzing around
Top Training Data Match: CLAP

Caption: A man speaking followed by a toilet flushing
     
Generated Sample

Prompt: A man speaking followed by metal rattling then a motorcycle engine starting up and running idle
Top Training Data Match: mel

Caption: Birds chirp and something squeaks while leaves rustle
Top Training Data Match: CLAP

Caption: A man talks followed by a motorcycle engine starting
     
Generated Sample

Prompt: Silence followed by a man speaking and then a toilet flushing
Top Training Data Match: mel

Caption: A shifting sound accompanies a knock, followed by a toilet flushing
Top Training Data Match: CLAP

Caption: A man speaking followed by a toilet flushing
     
Generated Sample

Prompt: A heavy rain falls
Top Training Data Match: mel

Caption: Flushing of a toilet as bells ring
Top Training Data Match: CLAP

Caption: Thunder clap and rain
     
Generated Sample

Prompt: Two snaps occur
Top Training Data Match: mel

Caption: A machine runs and then a loud burst of air pops
Top Training Data Match: CLAP

Caption: A small gunshot rings
     


Identified duplicates in AudioCaps training set

Full list [audiocaps_duplicates.csv]

Selected examples - cluster 0

Selected examples - cluster 2

Selected examples - cluster 53

Selected examples - cluster 54



MERL Publications

  •  Bralios, D., Wichern, G., Germain, F.G., Pan, Z., Khurana, S., Hori, C., Le Roux, J., "Generation or Replication: Auscultating Audio Latent Diffusion Models", arXiv, October 2023.
    BibTeX arXiv
    • @article{Bralios2023oct,
    • author = {Bralios, Dimitrios and Wichern, Gordon and Germain, François G and Pan, Zexu and Khurana, Sameer and Hori, Chiori and Le Roux, Jonathan},
    • title = {Generation or Replication: Auscultating Audio Latent Diffusion Models},
    • journal = {arXiv},
    • year = 2023,
    • month = oct,
    • url = {https://arxiv.org/abs/2310.10604}
    • }