When AI Translates Videos into Other Languages
Article in partnership with Day Translations.
If you ever listen to Hard Fork, the podcast from The New York Times about technology trends, you may have heard some experiments that the hosts did by digitising and manipulating their voices using AI tools.
One of these tools is Hey Gen, a website that converts short spoken audio and video files from one language to another. While I personally haven’t tried Hey Gen nor set up a profile, I saw several TikToks showing how the app works and some examples of it in action.
AI Videos Still Look Unnatural, for Now
Hey Gen and other AI-generated videos so far look quite odd. If you look at a person’s face while speaking, you tend to notice that the eyes may be quite expressionless and the way the mouth moves, while synchronised with the speech, seems quite cartoonish. Actually, the mouth shapes look more similar to stop motion or claymation than what a person looks like when they speak. Think about claymation movies such as Wallace and Gromit, with their mouths moving like they are chewing bubblegum when they speak.
Even though the actual words match the movements a person’s mouth makes while talking, when you look at a video where a speech has been translated you can’t get completely on board with what the person is saying, because the overall appearance is distracting.
This is likely to change in the future, as technology becomes more sophisticated. Surely, these types of automatically-generated videos should come with a written warning, perhaps a watermark, to indicate they were created through the assistance of AI.
What Are the Limitations of AI-Generated Video Translations?
Apps like Hey Gen have a number of limitations, as discussed in this article by 99marketingtools.com. At present, Hey Gen uses avatars that can be customised and have multiple options for their appearance in terms of ethnicity, gender and age, with a database of about 130 avatars, however the number of facial expressions is minimal and you can’t add gestures. The app generate talking videos from written text.
This means that the app itself is not able to translate text but it will generate a video of a talking person based on the input language, so in the same video you could have the same person sharing a short message in multiple languages. There is currently a database of about 40 different languages and if you don’t like the existing avatars you can upload your own picture.
There are also very strict time limits for videos. The monthly paid packages include upper video length limits so you need to contain the information you want to translate into short chunks.
You also need to consider rendering times, which can be four minutes long, with the normal the length of a generated video being about one minute.
Other AI-generated video apps such as Yepic and Deepbrain have slightly less static avatars who are provided with basic hand gestures to highlight key points in their presentation. Deepbrain has a database of over 80 languages and can generate a video from text in four minutes.
Yepic offers more than 60 languages in its AI video creation platform. It also offers live dubbing into other five languages of video calls.
Some Considerations
These and other available apps produce videos that can be used for different purposes, mainly social media and marketing, but you do wonder how much engagement these types of audiovisual products can have, considering that they lack the warmth of a real person. At present, these videos seem gimmicky and more of a novelty item than a credible way to communicate corporate messages. These apps’ main selling point is the monetary saving compared to hiring professional actors, studios and equipment, not to mention the cost of translating the original text into multiple languages.
Like many other scenarios when AI is deployed for language translation, you need to apply discretion about the actual information being shared. Confidential and sensitive information, such as medical data, needs strict policies and high levels of quality assurance and precision. This level of data handling is best left to language professionals who possess all the relevant qualifications.