Meet TANGO, an AI generating realistic human speakers

Researchers reveal TANGO, an AI system that generates realistic human speakers

Researchers have unveiled a new AI system called TANGO which can generate realistic full-body talking videos of people, showing how far synthetic media creation has come.

A series of videos have been created to showcase the tool, published on its website and YouTube, including showing how movements can be faked through technology to match any audio recording.

In one example, 10 separate videos of different individuals can be seen repeating the same script – all with expressive hand movements that look natural.

The team behind the AI tool has added it to the community-focused Hugging Face, where people can try it out for themselves using nine demo videos.

“Given a few-minute, single-speaker reference video and target speech audio, TANGO produces high-fidelity videos with synchronized body gestures,” writes the researchers in a paper that was submitted on October 5.

generating fullbody talking video with speech audio!💪 available on Hugging face Space. You can upload your audio, or clone it to process your custom character. https://t.co/RztXMghbWI, extension for dancing or sports is also coming! pic.twitter.com/2cYwnTRk9F

— Haiyang Liu (@HaiyangLiu3) October 13, 2024

How does TANGO AI work to generate realistic faux clips?

TANGO builds on Gesture Video Reenactment which splits and retrieves video clips using a graph structure.

Two limitations have then been addressed and solved which include audio-motion misalignment and virtual artifacts in GAN-generated transition frames.

To ensure TANGO operates smoothly, the team has retrieved relevant gestures using latent feature distance to improve cross-modal alignment. The relationship between speech audio and gesture motion was then built upon so realistic, audio-synchronized videos could be created.

The researchers believe this is the first work that has been created to present “CLIP-Like contrastive learning on audio and motion modalities, and it is the first open-source motion graph and audio-driven video generation pipeline.”

The team hopes to extend TANGO’s abilities in the future, so it can include dance, sports, and more.

The AI project comes amidst growing discourse about the use of AI in video content creation, as several video editing software now include some form of generative AI.

YouTube, arguably the most popular video-focused platform, introduced a disclosure tool in the Creator Studio back in March of this year.

Through this, creators are asked to disclose if their ‘realistic content’ has been made with altered or synthetic media, including generative AI.