Wan 2.2 Lora Workflow

Generating images with a LoRA, generating a custom voice and combining them both to create a talking avatar video. All using AI models available on fal.ai.

Using a Wan 2.2 Lora

InstaGirlv2 went viral on X.com with its realistic images.

The official LoRa is available on Civitai but I found it easier to use on Hugging Face here:

https://huggingface.co/gokaygokay/InstaGirlv2/tree/main

Click the safetensors file and the “Copy download link”.

There is a hinoise and lownoise file and we will use both.

On fal we use the Text to Image LoRA endpoint:
https://fal.ai/models/fal-ai/wan/v2.2-a14b/text-to-image/lora

Add a Lora for each file “Add Item” and paste the Path (URL) and set the Transformer, high or low.

Set the Low noise to 0.3 and high noise to 0.9.

This is the prompt and settings:

https://fal.ai/models/fal-ai/wan/v2.2-a14b/text-to-image/lora/playground?share=a7b32547-284c-4ada-90a0-fa759082bae9

In the prompt be sure to use the trigger word, in this case “Instagril”.

More info about the Instagirl LoRA: https://instagirl.org/ (unofficial site)

Civitai: https://civitai.com/models/1822984/instagirl-wan-22

I went on to create a voice and video.

MiniMax Voice Design

With MiniMax I used the Voice Design feature https://fal.ai/models/fal-ai/minimax/voice-design

I gave a chatbot the image created and asked for a voice description. Prompt along with the image:

I am using this voice design feature on fal. “Design a personalized voice from a text description, and generate speech from text prompts using the MiniMax model, which leverages advanced AI techniques to create high-quality text-to-speech.” Give me a prompt (“Voice description prompt for generating a personalized voice”) to match this young ladies voice

This gave me a description which I used to create the custom voice.

https://fal.ai/models/fal-ai/minimax/voice-design

I kept iterating to refine the voice to match the image.

First try:

A warm and friendly female voice in her mid-20s. The tone is relaxed and effortlessly smooth, with a hint of a gentle, breathy quality that makes it feel approachable and sincere. She speaks with a natural, conversational pace, like she’s sharing a story with a close friend on a sunny afternoon. There’s a subtle, underlying confidence in her voice, yet it remains soft and inviting. The pitch is pleasant and melodious, with a slight touch of vocal fry that adds a modern, authentic feel.

custom_voice_id: ttv-voice-2025081522532825-s21Mx1f5
demo: https://v3.fal.media/files/panda/IZRhVaJuaIUaWddUcTWJ3_preview.mp3

Second try:

A warm and friendly female voice in her mid-20s. The recording should sound natural and unpolished, as if captured on a smartphone in a quiet room rather than a professional studio. There should be a hint of natural room ambience, avoiding a perfectly silent, soundproof quality. Her delivery is relaxed and conversational, with occasional slight pauses and a pace that feels spontaneous. The tone is light and approachable, with a gentle, breathy quality and a touch of modern vocal fry, steering clear of any deep, resonant, or overly-enunciated ‘announcer’ sound.

custom_voice_id: ttv-voice-2025081523052525-aF35MkGg
example: https://v3.fal.media/files/panda/Gg-_sEjz4sbstBryov4dC_preview.mp3

Third try:

Generate a female voice for an influencer in her early 20s, capturing a fun, candid moment for a social media story. The voice must have a higher, brighter vocal pitch that sounds light, energetic, and completely avoids any deep or low tones. Her delivery is upbeat and bubbly, with a fast, natural pace as if she’s excited about what she’s sharing. The recording quality should sound like it was captured on a smartphone, with outdoor ambience.

custom_voice_id: ttv-voice-2025081523102625-JRHbRujZ
example: https://v3.fal.media/files/koala/rh2dFIhWws4r774KCTml-_preview.mp3

I ended up using this third iteration.

Now we can generate the audio: “My only job today was to relax. And I am, like, so stressed about it.”

ByteDance Omnihuman

With the image plus audio file I was able to combine them using the Omnihuman model. You could tell the final video was created with AI with the way it moves, but it was still an excellent result and for now is the best quality I have found for combining image and audio. But things change quick.

The final result can be found here: