Text-to-Audio

Also known as: Text-to-Audio Generation, TTA

A class of generative AI models that synthesise non-speech sound (environmental sounds, sound effects, music stems) from a text prompt - for example producing the sound of 'leaves rustling in wind' or 'church bells ringing'. Distinct from text-to-speech, which produces spoken words. Models such as AudioGen, AudioLDM, and Im2Wav are being explored for accessibility applications including scene sonification for blind and low-vision users, accessible media, and educational audio. Current limitations include hallucinated or mismatched sounds, difficulty composing multi-object scenes, and no direct control over event count or discreteness.

Category: Generative AI · Audio · Machine Learning · Accessibility Tools

Related: Generative AI · Sonification · Text-to-Speech · Hallucination

Sources

https://doi.org/10.1145/3772318.3791655