AtomoVideo:High Fidelity image-to-video generation

阅读量：

AtomoVideo：AIGC赋能下的电商视频动效生成本文分享阿里妈妈视频 AIGC（AtomoVideo等）赋能视频广告创意的探索和实践。通过基于扩散模型的视频生成技术，结合可控生成技术，使静态电商图片能够栩栩如生地“动”起来，实现了在电商领域的视频 AIGC 应用落地。 icon-default.png?t=N7T8 https://mp.weixin.qq.com/s/7QRSmAJKQPZVB-T_1CwNfg

1.Introduction

为了提高与给定图像的一致性，一些方法将图像编码为image prompt以cross attention注入到模型中，此类方法很难实现细粒度的一致性，因为仅使用高阶语义导致细节丢失。此外一个简单想法是在输入端附加额外的通道，虽然增加了更细粒度，但更难收敛，并且产生的视频稳定性较差。一些方法使用上述两种方法进行图像信息注入，在推理时使用噪声先验而不是纯噪声开始，为了补偿模型不稳定的伪影，由于噪声先验包含给定图像的信息，例如inversion of the reference latent，因此可以显著增强细粒度细节的保真度（注意inversion虽然也是获得latent，但是并不是简单的走一下vae，而是vae+text embed+unet得到的待解码的latent），这种方法显著的降低了运动强度，每一帧在噪声中都包含完全相同的给定图像，使得初始噪声random component decrease。

Atomovideo不依赖噪声先验，concat given image at the input，同时injecting high-level semantic cues through cross-attention to improve the consistency of the video generation with the given image.在训练过程中，zero terminal Signal-to-Noise Ratio（SNR）和v-prediction策略。在训练中保持固定的T2I模型，仅添加temporal layer和input layer参数。

2.Method

3.1 Overall pipeline

使用预训练的T2I模型，add 1D temporal convolution and temporal attention modules在每个spatial convolution and attention layer之后，只训练增加的层，固定T2I参数，同时，为了注入图像将input channel改为9个channel，包括image condition latent和binary mask，input concatenate image information is only encoded by VAE，它代表low-level信息，同时high-level语义信息以cross attention的形式注入网络。

3.2 Image information injection

I2V任务中，给定图像的一致性保持和视频运动状态的一致性通常是相互权衡的。

Xt：高斯噪声，Fm：VAE后的图像；Fi：输入帧mask。

还使用ip-adapter实现cross-attention。这和i2v-adapter很像，i2v-adapter多了个在self-attention上对首帧进行操作的attention adapter，值得好好看看。

3.3 Video Frames Prediction

以迭代方式实现长视频生成，即给定前面帧预测后续帧。

3.4 Training and Inference

采用sd1.5作为基础模型，Animatediff初始化时序注意力层，15M内部数据进行训练，每个视频长度约为10-30s，采用SNR和v-prediction，模型输入尺寸为512x512，24帧，推理执行cfg，图像和文本注入。

4.Experiments

全部评论 (0)

还没有任何评论哟~

AtomoVideo:High Fidelity image-to-video generation

AtomoVideo：AIGC赋能下的电商视频动效生成本文分享阿里妈妈视频AIGC（AtomoVideo等）赋能视频广告创意的探索和实践。通过基于扩散模型的视频生成技术，结合可控生成技术，使静态电商图...

S³GAN : High-Fidelity Image Generation With Fewer Labels (2019.03)

S³GAN:HighFidelityImageGenerationWithFewerLabels2019.03 介绍半监督学习：整个训练集的标签可以从一小部分标记过的训练图像中推断出来，而推断出来的...

tune a video:one-shot tuning of image diffusion models for text-to-video generation

[【DiffusionModels】新加坡国立大学、腾讯强强联手TuneAVideo：OneShot微调图像扩散模型用于文本到图像的生成！哔哩哔哩bilibili【DiffusionModels】新加...

【CVPR 2023】解读LFDM：Conditional Image-to-Video Generation with Latent Flow Diffusion Models

DiffusionModels视频生成博客汇总前言：LFDM通过流预测器分解latent和mask映射到flow域，大幅度降低了视频合成的成本，并且是为数不多的imagetovideo的工作，很多做...

From Image to Imuge: Immunized Image Generation

FromImagetoImuge:ImmunizedImageGeneration BriefIntroductionofmyself Introductionofthiswork RelatedWo...

读论文《Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models》

论文地址：2409.07452v1arxiv.org 项目地址：[GitHubyanghb22fdu/Hi3DOfficial:[MM24]OfficialcodesanddatasetsforACM...

【Diffusion 视频生成】Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

DiffusionModels专栏文章汇总：入门与实战前言：TuneAVideo提出了一个新的文本视频任务：OneShot视频生成，能够在视频对象编辑、背景编辑、风格转换、可控生成等领域取得了非常好...

dalle:zero-shot text-to-image generation

DALL·E—从文本到图像，超现实主义的图像生成器知乎欢迎关注Smarter，构建CV世界观超现实主义强调梦幻与现实的统一才是绝对的真实，而如今OpenAI创造的DALL·E图像生成器，能够直接通过文...

text-to-video generation(T2V)数据集

各个论文用到的主要数据集 MSRVTT KTH MSVDYoutube2Text Kinetic较难获取 UCF101 VaTEXVideoandTEXt DatasetOriginaltaskExa...

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

本文是LLM系列文章，针对《xGenVideoSyn1:HighfidelityTexttoVideoSynthesiswithCompressedRepresentations》的翻译。

是否确定退出登录?

AtomoVideo:High Fidelity image-to-video generation

全部评论 (0)

相关文章推荐

AtomoVideo:High Fidelity image-to-video generation

S³GAN : High-Fidelity Image Generation With Fewer Labels (2019.03)

tune a video:one-shot tuning of image diffusion models for text-to-video generation

【CVPR 2023】解读LFDM：Conditional Image-to-Video Generation with Latent Flow Diffusion Models

From Image to Imuge: Immunized Image Generation

读论文《Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models》

【Diffusion 视频生成】Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

dalle:zero-shot text-to-image generation

text-to-video generation(T2V)数据集

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations