For the last couple years, I have periodically heard the term “deepfake videos”, but prior to completing the research for this post, I didn’t know much about them. In fact, my knowledge of deepfake videos was limited to a few key facts that I’d heard repeated on the news and the internet: the average person can’t tell the difference between a real video and a deepfake video, anyone with a computer can make one, they will soon be everywhere, and they will definitely destabilize democracy.
Some scary stuff – but what are deepfake videos and how are they made? And, more importantly, how concerned should we actually be? I decided to find out.
What is a Deepfake?
The word “deepfake” is a portmanteau of “deep learning” and “fake.” Deepfake videos use deep learning algorithms (AI) to manipulate or generate visual and audio content with a high potential to deceive.
An encoder is used to reduce an image to “latent space,” and then a decoder is used to rebuild the face on another image. The end result is that a person in an existing image or video is replaced with someone else’s likeness.
Altered videos are not a new invention; expert artists and technicians in the film industry had been creating convincing “fakes” for years prior to the development of deepfake. However, deepfake videos have now brought this technology to the masses.
How it Works
There are two main software programs that most people use to make deepfake videos: DeepFaceLab and FaceSwap. Both of these programs use an autoencoder to compress a face into a very compact representation, and then decode it back to the original.
To create a deepfake, you need to feed the program two main pieces of data:
- Source: Video that includes the face that you want to swap in
- Destination: Video that includes the face that you want to swap onto
To create a believable deepfake video, the program requires a lot of good, high quality video of both the source and destination to feed the autoencoder.
Once you have your source and destination videos, you can then proceed through the following steps:
The first thing that the program is going to do is extract images from the source and the destination videos. The program runs through both of the videos and extracts frame-by-frame screenshots of all of the faces. The faces must then be cleaned up manually. This means that someone must scroll through the thousands of extracted face images to remove any photos that are not relevant, such as photos of other people or objects. This is a tedious and hours-long process that must be completed for every deepfake video created.
Encoding and Decoding
The next thing that happens is the encoding and decoding. The final selection of images is run through encoding and decoding in order to “train” the encoder. The process of encoding extracts data from the image and compresses it into its smallest form, called “latent space.”
The encoding focuses on the parts of the face that change as the person changes their facial expressions. For example, it doesn’t focus on features like eye color, which remain relatively constant. Instead, it focuses on things like eyebrow location, which changes drastically as you speak.
The compact form of the image is then “decoded” back into the face, and checked against the original. This process is repeated numerous times until the autoencoder is trained on that face. This process of learning through repetition represents the artificial intelligence (AI) part of the deepfake.
How It Becomes a Deepfake
The deepfake program runs the encode/decode operations on the source and destination at the same time. During training, the same encoder is used for both the source and destination:
Once the source face is encoded, it can then be decoded by its own decoder OR by the destination’s decoder. Because the encoder uses the same data points for the faces in the source and destination, the decoder can “read” the face data from either video:
Who Can Make a Deepfake?
Technically, anyone can make a deepfake. However, to make a successful deepfake, you have to have the right tools, including:
- Access to a lot of high quality video for your source and destination videos (one reason you see deepfakes done of celebrities is because there is already a lot of readily available high quality video of them)
- A computer with a lot of graphics processing units (GPU). The more GPU you have, the faster the encoding/decoding training will go.
- Ample time. Depending on the processing speed, the encoding/decoding training can take days or even weeks to complete.
How Convincing Are Deepfakes?
The success of a deepfake video is dependent on:
- The quality of the source and destination videos.
- The amount of encoding/decoding processing time the video is given.
The most convincing deepfake videos still seem to be done by professional videographers and graphics professionals. For the most part, fully convincing deepfakes still seem to be out of reach for the average user.
For example, this is a screenshot of a deepfake video that was created by a journalist who was creating a deepfake video for the first time:
This video took over a week to create. As you can see, though he was able to create a deepfake, it doesn’t look entirely convincing.
Believable deepfake videos do exist. For example, this deepfake video of former President Obama is incredibly realistic:
This deepfake video was created by Jordan Peele with the help of graphics professionals. Thanks to the video quality and time they had, they were able to make a successful deepfake video.
What Are Deepfakes Used For?
Since their creation in 2017, deepfake videos have been used for a few key things:
- Pornography: A 2019 report estimated that 96% of all deepfakes online were pornographic in nature.
- Other Entertainment, Satire, and Art: Though not as common, deepfakes have been used across a number of artistic mediums.
- Fraud: Several cases of online fraud have been discovered that appear to have used deepfake technology. There’s also been at least one case of deepfake audio fraud, where a CEO’s voice was spoofed to convince an employee to transfer money to an online account.
Surprisingly, though there has been lots of media attention about the potential for deepfake videos to upend democracy by creating fake incriminating videos of public figures, no deepfake videos have come anywhere near accomplishing these ends. The closest cases we have seen have been cases in which politicians claim an incriminating video was a deepfake when, in fact, it was authentic.
Old fashioned video manipulations, also known as “shallowfakes,” are actually much more prevalent in this arena. Shallowfakes are easier and quicker to produce than deepfake videos, and tend to be more realistic. For example, in a recent video of Nancy Pelosi, the video stream was slowed down to make her appear drunk. This simple video manipulation took a fraction of the time that a deepfake video would take and was just as effective.
Deepfake audio is when a cloned voice is used to produce synthetic audio. It is created through processing many audio samples and breaking down the speech into its component parts. Deepfake audio needs considerably less source material and processing time to create a voice match, making it easier and less time intensive than deepfake videos.
Creating a deepfake voice with one of these programs requires about 20 minutes of voice recording and 45 minutes of processing. When I created my own deepfake audio using Resemble.ai, I was asked to read and record about 50 sentences. Once the files were processed, the deepfake voice was able to say anything I typed.
While mine didn’t necessarily sound exactly like me (or even like a convincing human voice), it did sound far less robotic than a typical automated voice. Like deepfake video, the more recording and processing time the program is given, the closer the deepfake audio will sound to the source.
Unlike its often-nefarious counterpart, deepfake audio has a lot of very legitimate uses, including:
- Audio File Editing: Overdub by Descript allows you to fix audio files by typing instead of re-recording. This feature is used most commonly to edit podcasts.
- Interactive Game Speech: Deepfake audio is beginning to replace voice actors as the speech in interactive gaming or add speeches to areas in games that were formerly text.
- Text to Speech (TTS) software: TTS is used to create voices for people with vocal impairments. Deepfake audio is allowing TTS to replace robotic voices with realistic human voices.
What Are We Doing About Deepfakes?
In 2020, Facebook and TikTok banned deepfakes and will now take them down if they are detected. Twitter still allows deepfake videos to stay on its site, but it applies a “manipulated media” label to any deepfake videos identified.
The Defense Advanced Research Projects Agency (DARPA) and other government agencies are working on ways to detect deepfakes using AI. They are training their own neural nets in a similar way by giving them a dataset of real videos and a dataset of deepfake videos and training them to spot the deepfakes.
Deepfake videos and deepfake identification technology are symbiotically evolving. As the deepfake videos improve, the detection technology also improves, which in turn pushes the deepfake videos to keep getting better.
How Concerned Should I Be?
After completing my research on deepfake videos, I concluded that I’m not too concerned about them – at least not right now. At this point, the average person cannot make a realistic deepfake video. However, it is easy to see where the technology is going and that it will improved in the future.
Deepfakes were predicted to run rampant in the 2020 election; they never materialized but that doesn’t mean that it can’t happen in the future. As deepfake technology improves and becomes less resource intensive, researchers expect that deepfake videos will continue to become more prevalent. It will therefore be essential for the deepfake identification technology to keep up.
In my opinion, the biggest threat that deepfake videos pose is that they add yet another layer of distrust to legitimate video & news. Simply knowing that deepfakes exist is somewhat destabilizing and this is evident in the politicians already claiming that authentic videos are deepfakes created to discredit them.
Luckily, you can determine if videos are deepfakes the same way you check anything you read, hear, or see on the internet:
- Confirm that it is from a verifiable, trusted source
- Check for corroborating witnesses or other corroborating reports
- Critically think about whether it seems reasonable
- Fact check
As long as we continue to stay vigilant, as we do with other fake news sources and frauds, the threat of deepfake videos continues to be small. And, as a result, our democracy seems to be safely stable for another day.