Immediately after switching the page, it will work with CSR.
Please reload your browser to see how it works.
It's usually "generate a few, one of them is not terrible, none are exactly what I wanted" then modify the prompt, wait an hour or so ...
The workflow reminds me of programming 30 years ago - you did something, then waited for the compile, see if it worked, tried something else...
All you've got are a few crude tools and a bit of grit and patience.
On the i2v tools I've found that if I modify the input to make the contrast sharper, the shapes more discrete, the object easier to segment, then I get better results. I wonder if there's hacks like that here.
- Tennis clip => ball is strongly unsynced with hit
- Dark mood beach video, no one in the screen => very high audio mood, lots of laughter like if it was summer on a busy beach
- Music inpainting completely switching style of audio (e.g. on the siren)
- "Electronic music with some buildup" : the gen just turns the volume up ?
I guess we have still some road to cover, but it feels like early image generation with out of touch hands and visual features. At least the generation are not non-sensical at all
The tennis video, as other commented, is good but there is a noticeable delay between the action and the sound. And the "loving couple holding IA hands and then dancing", well, the input is already cringe enough.
For all these diffusion models, look like we are 90% here, now we just need the final 90%.