Generate video with native audio from text or image. It is a quick model with quality outputs but also has a distilled version for even quicker generation.
Some notes after testing it out:
In addition to the default negative prompt on fal.ai, I also added: Camera Zoom. Captions.
The characters were speaking slowly, even when I prompted them to talk fast. It seems they were speaking to fill the length of the video clip. When I gave them more to say they spoke in a more natural speed.
Prompts around 80 words seem to work best. Longer prompts and the model did not seem able to follow the instructions.
I was able to get 40 words of text spoken by a person. A 16 second clip generated in 5:30 seconds.
To help with audio I was adding “Studio audio – high quality.” or “Clear and present vocal recording, no echo.”