This thing we made is so brilliant, we can't risk releasing it to the general public. So Microsoft basically says about it's latest speech generator, VALL-E 2. So, does that reflect genuine concerns? Or is it a clever marketing ruse designed to get some viral traction and online chins wagging?
If it is all completely genuine, what does it say about Microsoft that it's knowingly creating AI tools too dangerous to release? It's a conundrum, to be sure.
Anyway, here are the basic facts of the situation. Microsoft says in a recent blog post (via Extremetech) that it's latest neural codec language model for speech synthesis, known as VALL-E 2, achieves «human parity for the first time».
More specifically, «VALL-E 2 can generate accurate, natural speech in the exact voice of the original speaker, comparable to human performance.» Now, to some extent, this is nothing new. However, it's the incredible speed with which VALL-E 2 can achieve this, or to put it another way, the incredibly limited sample or prompt it needs to achieve this feat that's remarkable.
VALL-E 2 can accurately mimic a specific person's voice based on a sample just a few seconds long. It pulls that trick off by using a huge training library that maps variations in pronunciation, intonation, cadence in the model to the sample and spits out what appears to be totally convincing synthesised speech.
Microsoft's blog post has a range of example audio clips demonstrating how well VALL-E 2 (and indeed its predecessor, VALL-E) can turn a short sample of between three and 10 seconds into convincing synthesised speech that's often indistinguishable from a real human voice.
It's a process known as zero-shot text-to-speech synthesis or zero-shot TTS for short. Again, the approach is nothing new, it's the accuracy and shortness of the sample audio that's novel.
Keep up to date with the most important stories and the best deals, as picked by the PC Gamer team.
Of course, the idea of weaponising such tools to create
Read more on pcgamer.com