Joint Talking Audio-Video Generation with
Autoregressive Diffusion Modeling
Joint audio-video generation models, such as dual-branch diffusion transformers, have demonstrated that generating audio and video within a unified model yields stronger cross-modal coherence than cascaded pipelines. However, existing joint audio-video diffusion models typically couple the two modalities throughout the entire denoising process via pervasive attention mechanisms, treating both high-level semantics and low-level signal details in a fully entangled manner. We argue that such uniform coupling is suboptimal for talking head synthesis. While audio and facial motion are strongly correlated at the semantic and temporal levels, their low-level realizations—acoustic signals and visual textures—follow distinct rendering processes. Enforcing joint modeling across all levels can therefore lead to unnecessary entanglement and reduced modeling efficiency. Instead, we posit that joint modeling should primarily occur at high-level semantic and temporal information, where cross-modal alignment is essential, while low-level refinement should be handled by modality-specific decoders.
Building on this observation, we propose Talker-T2AV, an autoregressive diffusion framework that realizes this principle: high-level cross-modal modeling in a shared autoregressive backbone, with low-level refinement delegated to modality-specific decoding. A shared autoregressive language model serves as a high-level temporal planner, jointly reasoning over audio and video in a unified patch-level token space with autoregressive left-to-right generation. Two lightweight diffusion transformer heads then decode the language model's hidden states into frame-level audio and video latents. Experiments on talking portrait benchmarks show that Talker-T2AV outperforms dual-branch diffusion transformer baselines in lip-sync accuracy, video quality, and audio quality, and further achieves stronger cross-modal consistency than cascaded pipelines, highlighting the complementary benefits of joint modeling.
Talker-T2AV decouples joint audio-video generation into high-level cross-modal planning and low-level modality-specific refinement.
High-level temporal planner: jointly reasons over audio & video in a unified patch-level token space
Lightweight diffusion transformer
decodes audio latents
Lightweight diffusion transformer
decodes video latents
Lip-sync accurate audio-video output
Given a text transcript, a speaker embedding and an identity image as input, our model jointly generates temporally aligned speech and talking-head video in a single unified autoregressive diffusion framework.
Given a short reference audio-video clip as a multimodal in-context prompt, our model clones the speaker's joint audio-visual talking style—capturing vocal characteristics and facial dynamics in a unified manner—and synthesizes new clips with arbitrary text content, requiring no speaker-specific fine-tuning.