Kandinsky-4
Описание
Text and image to video generation: Kandinsky 4.0 (2024)
Языки
- Python96%
- Jupyter Notebook4%
Kandinsky 4.0: A family of diffusion models for Video generation
In this repository, we provide a family of diffusion models to generate a video given a textual prompt or an image (Coming Soon), a distilled model for a faster generation and a video to audio generation model.
Project Updates
- 🔥 Source:
: We have open-sourced2024/12/13a distilled version ofKandinsky 4.0 T2V Flashtext-to-video generation model.Kandinsky 4.0 T2V - 🔥 Source:
: We have open-sourced2024/12/13a video-to-audio generation model.Kandinsky 4.0 V2A
Table of contents
- Kandinsky 4.0 T2V: A text-to-video model - Coming Soon
- Kandinsky 4.0 T2V Flash: A distilled version of Kandinsky 4.0 T2V 480p.
- Kandinsky 4.0 I2V: An image-to-video model - Coming Soon
- Kandinsky 4.0 V2A: A video-to-audio model.
Kandinsky 4.0 T2V
Coming Soon 🤗
Examples:
Kandinsky 4.0 T2V Flash
Kandinsky 4.0 is a text-to-video generation model leveraging latent diffusion to produce videos in both 480p and HD resolutions. We also introduce Kandinsky 4 Flash, a distilled version of the model capable of generating 12-second 480p videos in just 11 seconds using a single NVIDIA H100 GPU. The pipeline integrates a 3D causal CogVideoX VAE, the T5-V1.1-XXL text embedder, and our custom-trained MMDiT-like transformer model. Kandinsky 4.0 Flash was trained using the Latent Adversarial Diffusion Distillation (LADD) approach, proposed for distilling image generation models and first described in the article from Stability AI.
The following scheme describes the overall generation pipeline:
Inference
Please, refer to examples.ipynb notebook for more usage details.
Distributed Inference
For a faster inference, we also provide the capability to perform inference in a distributed way:
NUMBER_OF_NODES=1
NUMBER_OF_DEVICES_PER_NODE=8
python -m torch.distributed.launch --nnodes $NUMBER_OF_NODES --nproc-per-node $NUMBER_OF_DEVICES_PER_NODE run_inference_distil.py
Kandinsky 4.0 I2V (image-to-video)
Coming Soon 🤗
Examples:
|
|
|
Examples T2I + I2V:
Kandinsky 4.0 V2A
Video to Audio pipeline consists of a visual encoder, a text encoder, UNet diffusion model to generate spectrogram and Griffin-lim algorithm to convert spectrogram into audio. Visual and text encoders share the same multimodal visual language decoder (cogvlm2-video-llama3-chat).
Our UNet diffusion model is a finetune of the music generation model riffusion. We made modifications in the architecture to condition on video frames and improve the synchronization between video and audio. Also, we replace the text encoder with the decoder of cogvlm2-video-llama3-chat.
Inference
Examples:
Authors
Project Leader: Denis Dimitrov.
Scientific Advisors: Andrey Kuznetsov, Sergey Markov.
Training Pipeline & Model Pretrain & Model Distillation: Vladimir Arkhipkin, Lev Novitskiy, Maria Kovaleva.
Model Architecture: Vladimir Arkhipkin, Maria Kovaleva, Zein Shaheen, Arsen Kuzhamuratov, Nikolay Gerasimenko, Mikhail Zhirnov, Alexander Gambashidze, Konstantin Sobolev.
Data Pipeline: Ivan Kirillov, Andrei Shutkin, Kirill Chernishev, Julia Agafonova, Elizaveta Dakhova, Denis Parkhomenko.
Video-to-audio model: Zein Shaheen, Arseniy Shakhmatov, Denis Parkhomenko.
Quality Assessment: Nikolay Gerasimenko, Anna Averchenkova, Victor Panshin, Vladislav Veselov, Pavel Perminov, Vladislav Rodionov, Sergey Skachkov, Stepan Ponomarev.
Other Contributors: Viacheslav Vasilev, Andrei Filatov, Gregory Leleytner.