LTX-2, a promising open source video model

Generate video with native audio from text or image. It is a quick model with quality outputs but also has a distilled version for even quicker generation.

Some notes after testing it out:

In addition to the default negative prompt on fal.ai, I also added: Camera Zoom. Captions.

The characters were speaking slowly, even when I prompted them to talk fast. It seems they were speaking to fill the length of the video clip. When I gave them more to say they spoke in a more natural speed.

Prompts around 80 words seem to work best. Longer prompts and the model did not seem able to follow the instructions.

I was able to get 40 words of text spoken by a person. A 16 second clip generated in 5:30 seconds.

To help with audio I was adding “Studio audio – high quality.” or “Clear and present vocal recording, no echo.”