Be ready for the next generation of AI

Does anyone else have vertigo? While the AI ​​community had only just gotten a feel for the staggering progress of text-to-image systems, we’re already on the next step: text-to-video systems.

Indeed, last week Meta unveiled “Make-a-Video”, an AI that generates five-second video clips from simple text instructions.


Meta’s new AI can generate videos from simple text

Based on open source datasets, “Make-a-Video” lets you type a series of words on your keyboard like “A dog wearing a superhero outfit with a red cape flies through the sky” and it generates then a five-second video clip that, while fairly accurate, takes on the aesthetics of an old amateur video.

This development constitutes a step forward in generative AI but at the same time raises some complex ethical questions. Creating videos from text instructions is much more complicated and expensive than generating images. The fact that Meta got there so quickly makes the feat all the more impressive. However, as the technology develops, there is concern that it will become a powerful tool for creating and spreading misinformation.

Immediately unveiled, immediately out of date? Only a few days after being presented, Meta’s system already looks basic. Indeed, “Make-a-Video” is just one of many text-to-image AI systems featured in research papers at one of the leading artificial intelligence conferences, the International Conference. on Learning Representations.

Another such system, dubbed “Phenaki“, is even more advanced than “Make-a-Video”.

It can generate a video from a still image and a short text, rather than text instructions alone. “Phenaki” can also make clips much longer than Meta’s model: users can create multi-minute clips from several different sets of words that form the storyline of the video. (For example: “A photorealistic teddy bear swims in the ocean in San Francisco. The teddy bear goes underwater. The teddy bear continues to swim underwater with colorful fish. A panda swims under water”.)

Technology like this could revolutionize the film industry and the world of animation. Frankly, it’s amazing how quickly this has happened. “DALL-E” was launched last year. It’s both hugely exciting and slightly scary to imagine where we’ll be in a year at this time.

At the conference, researchers working for Google also presented a paper on their new model called “DreamFusionThe latter generates 3D images from text instructions. 3D models can be viewed from any angle, lighting can be changed, and the model can be placed in any 3D environment.


How Instagram filters shape our view of beauty

Don’t expect to be able to experience these models anytime soon. At this time, Meta does not make “Make-a-Video” available to the public. This is a good thing. Meta’s model is trained using the same open-source image dataset that powers Stable Diffusion. The company says it has filters in place to hide toxic language and NSFW imagery. However, this does not guarantee that it will have captured all the nuances of the unpleasant aspects of humanity since the data sets collected consist of several million samples. Also, to put it mildly, Meta cannot claim to have an excellent track record when it comes to limiting the damage caused by the systems it builds.

In their article, the creators of “Pheraki” write that while the quality of the videos produced by their model is not yet indistinguishable from the real videos, this “is within the realm of the possible, even today”. The creators say that before releasing their model, they want to learn how to better understand the data, textual instructions and filtering outputs as well as measure biases in order to mitigate harm.

Online, it’s going to get harder and harder to tell right from wrong. With video AI, we see a series of unique potential dangers that we did not fear with audio and images such as the prospect of facing supercharged deepfakes. Platforms like TikTok and Instagram are already distorting our perception of reality through filters. The AI-generated video could be a powerful disinformation tool because, according to Penn State University researcherspeople are more likely to believe and share fake videos than fake audio or text on the same content.

To conclude, we are far from having understood how to deal with toxic elements in language models. We’ve only just begun to examine the harm that text-to-image AI systems could cause. And for the video? Good luck.

Article by Melissa Heikkilä, translated from English by Kozi Pastakia.


European Union: companies held liable for damage caused by AI?

We want to give thanks to the writer of this short article for this remarkable content

Be ready for the next generation of AI

Our social media profiles here and other related pages here