TangoFlux: Ultra-high-speed sound effect generation model, generating 30 seconds of high-quality audio in 3 seconds

Author: LoRA Time: 02 Jan 2025 1498

In the field of artificial intelligence, text-to-audio generation technology is developing rapidly and becoming a research hotspot. Recently, researchers launched a new model called TANGOFLUX , which has attracted widespread attention due to its excellent performance and efficient generation capabilities.

TANGOFLUX is an efficient text-to-audio generation model with 515 million parameters, capable of generating high-quality audio of up to 30 seconds with a sampling rate of 44.1kHz in just 3.7 seconds . Its efficiency is on full display on a single A40 GPU , outperforming many existing models.

One of the highlights of TANGOFLUX is its ability to generate many types of sound effects, including birdsong, whistles, explosions , etc., and supports generating music, although it is less effective in music generation. It can more clearly reproduce the sequence of events and audio details when generating event-specific sounds, and it also performs well in terms of sound quality.

A key challenge in text-to-audio generation is how to establish efficient preference pairings. Unlike large language models, text-to-audio generative models lack verifiable reward mechanisms or standard answers. To solve this problem, the research team proposed an innovative framework called CLAP-Ranked Preference Optimization (CRPO) . This framework improves model alignment performance by iteratively generating and optimizing preference data. Research shows that audio preference data generated using the CRPO framework outperforms existing methods in several aspects.

Thanks to this framework, TANGOFLUX has demonstrated leading performance in multiple objective and subjective benchmarks. In addition, the research team decided to open source the model and all code to support global researchers to further promote the research and application of text-to-audio generation technology.

The audio generation quality of TANGOFLUX is significantly better than other models, especially in the clarity of event sounds, sequence reproduction and overall sound quality. Users can directly feel its advantages through multiple examples. The introduction of this technology makes the prospects of text-to-audio generation even broader. In the future, it has huge application potential in film and television production, game sound effects and other fields.

Project entrance:TANGOFLUX

Key points summary:

TANGOFLUX is an efficient text-to-audio generation model that can generate 30 seconds of high-quality audio in 3.7 seconds .
Proposed the CLAP-Ranked Preference Optimization (CRPO) framework to significantly improve model performance and audio generation quality.
All codes and models have been open source, aiming to promote the research and application of text audio generation.

Tips & Information

TangoFlux: Ultra-high-speed sound effect generation model, generating 30 seconds of high-quality audio in 3 seconds

Meta launches new AI chatbot features: actively sending messages to improve interactive experience

Abacus.AI launches DeepAgent, all-round AI assistant leading the intelligent transformation of enterprises

In the era of big models, where will general visual models go?

X platform pilots AI to generate "community notes", Grok access information verification process