Remember that late-night talk show bit where an image of a political figure is shown with someone else's mouth superimposed on top, in order to make them say dubious things? It always looked a little ropey, but that was part of the effect. Well, this new AI tool also takes still images of human subjects and animates the mouth and head movements, but this time the effect is surprisingly, almost worryingly convincing.
The tool is called EMO: Emote Portrait Alive, and it's been developed by several researchers from the Institute for Intelligent Computing, part of the Alibaba Group. The tool takes a single reference image, extracts generated motion frames, and then combines them with vocal audio through a complex diffusion process in which the facial region is integrated with multi-frame noise samples and then de-noised while adding generated imagery to synch with the audio, eventually generating a video of the subject not only lip-synching, but also emoting various facial expressions and head poses.
The technology is demonstrated using sample images of various figures ranging from real-life celebrities, to AI generated people, to the Mona Lisa, while the vocal audio used includes a Dua Lipa track, pre-recorded interview clips, and Shakespearian monologues. After the process has been applied the generated avatar appears to have come to life, mouthing and moving to the chosen audio.
The effect is surprisingly accurate, although it has to be said, far from perfect. «Buh» sounds sometimes appear to come from open mouths rather than closed lips, and the occasional syllable appears from clenched teeth, as if the avatar is resisting the AI's insistence on bringing them to life to sing and perform for the internet.
This is mind blowing.This AI can make single image sing, talk, and rap from any audio file expressively!
Read more on pcgamer.com