Editing is an art. Bringing a story to life calls for tons of patience and hard work. But, what if we told you video editing could be as simple as text editing. Stanford University computer scientist Maneesh Agrawala, and his team has developed such a video editing software.
Agrawala is the Forest Baskett professor of computer science and director of the Brown Institute for media innovation at Stanford University. Previously, he served as a professor of electrical engineering and computer science at the University of California, Berkeley, for over a decade. He specialises in computer graphics, human-computer interaction and visualisation.
In an exclusive interaction with Analytics India Magazine, Agrawala explained the tech behind video editing tools, applications and challenges.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
AIM: Tell us about the technology behind your video editing software. How is it different from existing tools in the market?
Maneesh Agrawala: Today, most tools force you as an editor to work at a frame level, so you have to find the exact frame you want to cut the video. It isn’t easy to think in terms of the frame at which things are happening and produce desired results.
Our software converts the video file into high-quality transcripts. Once aligned, the user can directly click on a word to jump to the specific part of the video. Instead of scrubbing through the entire video, reading is much faster any day. This can be used for navigation.
We have simplified it further, where the user can cut, copy and paste on the text to transfer or edit the underlying video. You can edit your video much like you are editing a text document.
Initially, we only allowed cut, copy and paste options. Recently, we have introduced a new feature, where the user can type in new words into the transcript, and the person in the video will say those words.
Coming to the working of the model, we write the code ourselves. The data comes in as a video (any formats). Once fed into the system, we take the audio portion and run speech-to-text to get the transcript.
We have built a pipeline for taking the input text and have developed a model of the human face of the person saying those words. The neural model reconstructs the face and the lip motions that correspond to the new sounds and texts.
In terms of the SOTA algorithms, we have used convolutional neural networks all over the place, integrated with face-tracking tools, which also comes with an underlying 3D head model.
AIM: What are the use cases for your video editing software?
Maneesh Agrawala: We are starting to explore healthcare, where we want to give people their voice back after they have had throat or laryngectomy surgery. In most cases, people can no longer speak with their natural voice after the surgery. They have to use an electrolarynx, a device that needs to be placed up your throat, and the vibration causes it to generate a robotic sounding voice.
Here, we plan to record a patient’s voice before the surgery and then use that pre-surgery recording to convert their electrolarynx voice back into their pre-surgery voice.
Besides this, we are interested in making tools that allow people to express themselves and create stories. It’s a significant component of human culture, and the kind of tools that we talked about earlier, in a way, support the creation of these kinds of video stories.
AIM: Today, deep fakes are becoming next to impossible to detect. How can we put an end to this menace?
Maneesh Agrawala: From a user standpoint, when you are using tools to manipulate videos, two things need to be considered — the audience should be made aware of the manipulated video, and consent of the actor in the manipulated videos is a must.
The tools to detect deep fakes are not perfect. But we have algorithms and techniques to find those imperfections.
For example, we have developed a tool that considers lip movement around certain phonemes, where we focus on the visemes associated with words having the sound M (mama), B (baba), or P (papa) in which the mouth must completely close in pronouncing these phonemes. However, emotions that forced the mouth to close often are not well reproduced by some tools. That can work to some extent, but they are not foolproof.
In the long run, I do not believe that detection tools will work in any reasonable way.
In the future, the tools for creating these deep fakes are just going to get better and better. They will get so good that it will be impossible to detect whether this was an original or a fake.
Unfortunately, we do not have any technology that can determine if someone is lying. That said, we need to think about ways to mitigate the problem. We also need to think about other societal biases to prevent misuse and misinformation.
AIM: Currently, there are limited open-source codes and libraries for video editing tools. How do you deal with this challenge in terms of collaboration within the ecosystem?
Maneesh Agrawala: Ideally, we look at research papers and check if they sound and assess the results they can produce.
I think some people do release their code today. But, we have chosen not to release them because of the potential harm. However, if researchers ask us, we are open to sharing the tools under certain limitations and understanding as we want to be careful about its implications.