Microsoft’s Voice Mimicking Achievement Takes Natural Language Generation to New Levels, Albeit Controversial

R. Bhattacharyya

Summary Bullets:

• Custom Neural Voice can be trained to generate natural language that sounds like a specific person.

• Microsoft has considered the implications of Custom Neural Voice and prioritizes responsible use of the technology but the solution underscores the urgency for discussions related to Responsible AI.

Microsoft recently announced general availability, in limited access (use cases subject to Microsoft approval) of Custom Neural Voice, a service that uses artificial intelligence to generate natural language (enabling computers to ‘speak’). The achievement is quite impressive because of the level of customization it offers. Enabling computers to talk isn’t new, but what does raise eyebrows is that Custom Neural Voice can be trained to generate natural sounding speech that mimics a person. And not just a fictional person – but a specific individual.

The technology is already in use. At an AT&T retail store in Dallas, the voice of Bugs Bunny converses with customers, offering personalized greetings. In the future, the technology could be trained using specific actors’ voices, leading to applications in film making or television. Should an actor stumble over words or forget their lines in a script, it could be used to recreate the appropriate dialogue, thereby reducing the need for repeating and rerecording scenes.

For decades the ability to converse with computers had been the stuff of science fiction movies. But today’s natural language generation technology is widely available. Google offers Cloud Text-to-Speech, Microsoft offers Azure Cognitive Services Text to Speech, IBM has Watson Text-to-Speech, and Baidu provides Text-to-Speech. The challenge is that computer generated speech has often sounded robotic, making the user experience feel rather ‘clunky’. Recent developments have enabled speech generation to sound more natural, and companies often offer a range of voices and dialects that customers can select to customize interactions. For example, Google’s Cloud Text-to-Speech offers more than 180 voices across over 30 languages and variants.

However, the ability to imitate an individual’s speech (each person has a unique prosody, which is the tone and duration of phonemes, or units of sound) takes natural generation to another level. It also underscores the urgency for discussions related to Responsible AI. Imagine that a computer, with enough sample data, could be trained to sound like any individual, and to say anything. Similarly, deepfake videos, which use machine learning to generate visual content, can be made to depict individuals doing or saying just about anything, with improving quality. It doesn’t take a highly creative individual to see how it could be used for fraudulent or malicious purposes. Microsoft says it has considered the implications of Custom Neural Voice and prioritizes responsible use of AI. The company must approve all applications of the technology and has built in safeguards to ensure speakers consent to the use of their voices. But if Microsoft can develop the technology, likely others aren’t far behind, including companies and individuals that may not have the same strong Responsible AI principles or processes in place.

The new solution highlights an issue that has cast a shadow over AI: responsible and acceptable use. It’s a difficult topic because what is considered acceptable use by one culture may not be acceptable to another – it can even vary from individual to individual. For example, facial recognition has been in the public eye over the past year, with a good number of people in the US and Europe eager to limit its use by law enforcement. Elsewhere, adoption by law enforcement is seen as an improvement to public safety and creates a greater sense of security. What happens if analysis of facial movements is used to judge whether a person is lying, paying attention in class, or engaged in a meeting?

Technology moves quickly – often faster than society or the regulatory environment. But advances in AI are going to continue, and current events underscore the need for collective conversations on Responsible AI that includes policy makers, technology vendors, academics, and civil liberties organizations. Yesterday it was facial recognition, today it’s natural language generation, tomorrow it may be something entirely different – now is the time to start having these difficult discussions and to address the need for regulations that guide its use, domestically and internationally.



What do you think?

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.