Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer

我们提出了一种新的方法,从单个图像中生成高质量、空间时间上相互一致的人类视频。我们的框架将U-Net的准确条件注入和扩散变换器的全局关联优势相结合。核心是一个级联4D变换器架构,通过在视图、时间和空间维度上分解注意力,实现对4D空间的高效建模。通过注入人类身份、相机参数和时间信号到相应的变换器,实现精确的条件的实现。为了训练这个模型,我们创建了一个多维数据集,包括图像、视频、多视角数据和3D/4D扫描,以及多维训练策略。通过广泛的实验,我们证明了我们的方法在合成真实、连贯且自由观看的人类视频方面具有能力,为虚拟现实和动画等领域的高级多媒体应用铺平道路。我们的项目网站是https://www.ourproject.org。

We present a novel approach for generating high-quality, spatio-temporally coherent human videos from a single image under arbitrary viewpoints. Our framework combines the strengths of U-Nets for accurate condition injection and diffusion transformers for capturing global correlations across viewpoints and time. The core is a cascaded 4D transformer architecture that factorizes attention across views, time, and spatial dimensions, enabling efficient modeling of the 4D space. Precise conditioning is achieved by injecting human identity, camera parameters, and temporal signals into the respective transformers. To train this model, we curate a multi-dimensional dataset spanning images, videos, multi-view data and 3D/4D scans, along with a multi-dimensional training strategy. Our approach overcomes the limitations of previous methods based on GAN or UNet-based diffusion models, which struggle with complex motions and viewpoint changes. Through extensive experiments, we demonstrate our method’s ability to synthesize realistic, coherent and free-view human videos, paving the way for advanced multimedia applications in areas such as virtual reality and animation. Our project website is this https URL.

https://arxiv.org/abs/2405.17405

https://arxiv.org/pdf/2405.17405.pdf

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注