The Ethics of Voice Cloning

July 17, 2021
Clayton Rice, K.C.

Digital cloning is a technology that uses deep learning algorithms to manipulate audio, photographs and videos making it difficult to distinguish what is fake from what is real. Although the technology has been used for beneficial educational purposes, such as synthesizing audio books without human readers, concerns continue to emerge about the potential for fraud, identify theft and data breaches. Digital cloning has the potential to alter the integrity of the historical record when the voices and images of public officials are manipulated to say and do things they never said or did. Recent advances in voice cloning software are capable of creating accent, timbre and pace as well as portraying emotion. The ethical questions associated with the technology have now bounced back into public discourse with the release of the documentary film about Anthony Bourdain, by the Oscar-winning filmmaker Morgan Neville, in which the late globetrotting chef speaks words he never said.

1. Introduction

The latest controversy was kick-started by Helen Rosner in an article titled A Haunting New Documentary About Anthony Bourdain dated July 15, 2021, published by The New Yorker. (here) Ms. Rosner interviewed Mr. Neville at the time of the documentary’s premiere this week. Throughout the movie Roadrunner: A Film About Anthony Bourdain, Mr. Neville and his team collected excerpts of Mr. Bourdain’s narration from various sources including television, podcasts and audio books. But there is a moment at the end of the film’s second act when the artist David Choe reads an email he received from Mr. Bourdain. Ms. Rosner asked Mr. Neville how he found a recording of Mr. Bourdain reading his own email.

Mr. Neville explained there were three quotations he wanted to use in the film for which there were no known recordings. So he “created an AI model” of Mr. Bourdain’s voice by providing about twelve hours of recordings to an unnamed software company. Ms. Rosner described the seamless effect as eerie. “If you watch the film, other than that line you mentioned, you probably don’t know what the other lines are that were spoken by the A.I., and you’re not going to know,” Mr. Neville said. “We can have a documentary-ethics panel about it later.” Well, it wasn’t necessary to wait for a panel to be assembled. Online commentary was instantaneous.

3. The Ethical Dimension

The first record of “mimicking human voice” dates back to 1779 when Professor Christian Kratzenstein built acoustic resonators that mimicked the human vocal tract when activated by vibrating reeds in his lab in St. Petersburg, Russia. Professor Kratzenstrin’s resonators functioned like wind instruments. Homer Dudley presented the first electrical speech synthesizer, VODER (Voice Operating Demonstrator), at the New York World’s Fair in 1939. Then, in 1968, Noriko Umeda and his colleagues built the first text-to-speech system for English. Technology based on “deep learning” or Generative Adversarial Networks (GAN) now leads the development of voice synthesis. Although voice synthesis may save considerable cost in commercial applications such as gaming and audio books, “the focus is on facilitating the use of several different sounding voices at scale.” Virtual assistants like Amazon’s Alexa and Apple’s Siri require huge datasets for training in order to produce a single voice. (here)

The deep learning development company, Respeecher, founded in 2017, emphasizes on its website that creating “fake news” and deceiving people into thinking someone said something they didn’t is fraudulent and “just plain wrong”. (here) In its Ethical Voice Cloning Principles, Respeecher states that it “does not use voices without permission when this could impact the privacy of the subject or their ability to make a living.” Although Respeecher does not use the voice of a “private person” without permission, it will allow “non-deceptive use” of the voices of historical figures and politicians. (here) There are, then, two elements in Respeecher’s statement of principles that bear upon the ethical analysis – consent and disclosure.

With the release of the “emotionally searing” Bourdain documentary the debate has been renewed about the future of voice cloning, not only in film, but also in the commercial sector and the world of politics. (here) “Unapproved voice cloning is a slippery slope,” said Andrew Mason, the CEO of Descript in a blog post. “[A]s soon as you get into a world where you’re making subjective judgment calls about whether specific cases can be ethical, it won’t be long before anything goes.” (here) In an interview for Associated Press, Mr. Mason told journalists Matt O’Brien and Barbara Ortutay that Descript has repeatedly rejected requests to “bring back a voice” including from “people who have just lost someone and are grieving.” (here)

The right of the individual to “own and control” the use of his or her digital voice, an aspect of the right to personal autonomy, is emphasized in Descript’s Ethics Statement. (here) Descript uses a process for training speech models that depends on “verbal consent verification” ensuring that customers can only create text-to-speech models that are authorized by the voice’s owner. Once the model is created, “the voice owner has control over when and how it is used.” As Mr. Mason said during his interview with Mr. O’Brien and Ms. Ortutay, there have to be “some bright lines” about what is ethical in generative media – the rapidly advancing field of research that relates to deep fakes and other forms of synthesized audio and video.

In a follow-up piece for The New Yorker titled The Ethics of a Deepfake Anthony Bourdain Voice dated July 17, 2021, Ms. Rosner emphasized the “troubling connotations” associated with deep fakes and other computer generated synthetic media that make it “natural for viewers, and filmmakers, to question the boundaries of its responsible use.” Mr. Neville’s cavalier comment about having “a documentary-ethics panel” was not reassuring that he “took these matters seriously.” (here) Ms. Rosner subsequently interviewed two well-qualified people for “Neville’s hypothetical ethics panel” on the core principles of consent and disclosure. The first was Sam Gregory, a former filmmaker and program director of Witness, a human rights organization that focuses on ethical applications of video and technology. The second was Karen Hao, an editor at the MIT Technology Review, who focuses on artificial intelligence.

After a discussion about the confusion that developed publicly over whether Mr. Neville had the consent of “Bourdain’s inner circle” Mr. Gregory addressed the problem of how to “show manipulation in a way that’s responsible to the audience and doesn’t deceive them.” He used the example of a Ken Burns documentary that doesn’t state “reconstruction” at the bottom of every photograph he animated. Mr. Gregory went on to suggest that the discomfort about Roadrunner may be attributable to the novelty of the technology. It’s causing us to think about “how this will play out, in terms of our norms of what’s acceptable, our expectations of media.” It may be that we will become comfortable with it in the same way we are with a narrator reading a poem by Walt Whitman or a letter composed by a Civil War soldier.

Ms. Hao speculated that a particularly unsettling aspect of the Bourdain voice clone may be that it’s a hybrid of reality and unreality. “It’s not clearly faked, nor is it clearly real, and the fact that it was his actual words just muddles that even more,” she said. Ms. Hao highlighted the disclosure problem in that the synthetic Bourdain voice was undetected until Mr. Neville pointed it out. The artificial voice may also disturb people because of the close connection many feel with Mr. Bourdain – what psychologists call a parasocial relationship. There is a profound visceral reaction to the manipulation of our understanding of Anthony Bourdain “without his consent and without our knowing”. However, Ms. Hao urged people to “give the guy some slack”. This is new ground and Mr. Neville crossed “a boundary that didn’t previously exist.”

4. The Cybersecurity Risk

The advances in artificial intelligence and text-to-speech synthesis now enable researchers to create voice clones using a five second recording of a person’s voice that is unnervingly exact. (here) And the Canadian firm Resemble AI can turn cloned English voices into fifteen other languages. (here and here) The exposure to malicious uses of the technology is steadily increasing. Numerous people have fallen prey to “grandparent scams” where an elderly person receives a phone call supposedly from a grandchild in distress who needs money; or an employee is contacted by a boss who directs him to immediately wire funds to a vendor. The second example is from a case where a U.K.-based CEO was tricked into transferring US$243,000 to impersonators using a cloned voice of his German boss reported by Catherine Stupp in an article titled Fraudsters Used AI to Mimic CEO’s Voice in Unusual Cybercrime Case published in The Wall Street Journal edition of August 30, 2019. The Federal Trade Commission in the United States has identified voice cloning as one of the technologies making it more difficult for consumers to identify these sorts of “social engineering scams”.

5. Conclusion

Digital cloning is an ominous development in artificial intelligence. The dangers inherent in voice cloning are somewhat more immediate than deep fake videos which require large amounts of source imagery in order to produce a realistic product. A believable voice clone can be created with a much smaller sample. The lack of public awareness, and the absence of a regulatory regime, combined with the relative ease of production make voice cloning a development that merits the attention of lawmakers. Identity theft is difficult to redress as the malicious use of deep fakes may cause more psychological or emotional harm than financial consequences. And the threat that voice cloning poses to privacy, cybersecurity and democratic processes means it is not enough to easily forgive those engaged in developing and disseminating the technology because this is new ground. The controversy that emerged with the release of Roadrunner is not unlike the one that circulated about the potential for manipulation of the historical record following release of the award-winning movie Forrest Gump in 1994. No, Virginia. Forrest never visited President Kennedy at the White House.

Blog

The Ethics of Voice Cloning