Text-to-Audio
Also known as: Text-to-Audio Generation, TTA
A class of generative AI models that synthesise non-speech sound (environmental sounds, sound effects, music stems) from a text prompt - for example producing the sound of 'leaves rustling in wind' or 'church bells ringing'. Distinct from text-to-speech, which produces spoken words. Models such as AudioGen, AudioLDM, and Im2Wav are being explored for accessibility applications including scene sonification for blind and low-vision users, accessible media, and educational audio. Current limitations include hallucinated or mismatched sounds, difficulty composing multi-object scenes, and no direct control over event count or discreteness.
Category: Generative AI · Audio · Machine Learning · Accessibility Tools
Related: Generative AI · Sonification · Text-to-Speech · Hallucination