We present the first method capable of controlling the speaker identity and allowing movement editing with high-quality co-speech generation.
Co-speech gesture generation is crucial for automatic digital avatar animation. However, existing methods suffer from issues such as unstable training and temporal inconsistency, particularly in generating high-fidelity and comprehensive gestures. Additionally, these methods lack effective control over speaker identity and temporal editing of the generated gestures.
Focusing on capturing temporal latent information and applying practical controlling, we propose a Controllable Co-speech Gesture Generation framework, named C2G2. Specifically, we propose a two-stage temporal dependency enhancement strategy motivated by latent diffusion models. We further introduce two key features to C2G2, namely a speaker-specific decoder to generate speaker-related real-length skeletons and a repainting strategy for flexible gesture generation/editing.
C2G2: Model Structure
Using C2G2, you could generate high-fidelity gestures with in-the-middle movement editing without mismatching (frames from 70-94 is edited).
Visual effects of localize the generated results into actual frames. We can generate long-term reliable results with 4 pre-frames and one random frame conditioning.
Start Frame
Visual effects of localize the generated results into actual frames. We can generate long-term reliable results with 4 pre-frames and one random frame conditioning.
Start Frame
Start Frame