C2G2: Controllable Co-speech Gesture Generation

1Xi'an Jiaotong Liverpool University, 2Bytedance Research

Abstract

We present the first method capable of controlling the speaker identity and allowing movement editing with high-quality co-speech generation.

Co-speech gesture generation is crucial for automatic digital avatar animation. However, existing methods suffer from issues such as unstable training and temporal inconsistency, particularly in generating high-fidelity and comprehensive gestures. Additionally, these methods lack effective control over speaker identity and temporal editing of the generated gestures.

Focusing on capturing temporal latent information and applying practical controlling, we propose a Controllable Co-speech Gesture Generation framework, named C2G2. Specifically, we propose a two-stage temporal dependency enhancement strategy motivated by latent diffusion models. We further introduce two key features to C2G2, namely a speaker-specific decoder to generate speaker-related real-length skeletons and a repainting strategy for flexible gesture generation/editing.

Model

Interpolate start reference image.

C2G2: Model Structure

Short clips

Long clips

Middle clips (In-betweening Edit)

First 4 frames is given as pre condition, 150-173 is ground truth; 144-150 uses gt for repaint sampling

In-the-Middle Editing

Using C2G2, you could generate high-fidelity gestures with in-the-middle movement editing without mismatching (frames from 70-94 is edited).

Speaker-Related Generation

Speaker 1

Visual effects of localize the generated results into actual frames. We can generate long-term reliable results with 4 pre-frames and one random frame conditioning.

Interpolate start reference image.

Start Frame


Speaker 2

Visual effects of localize the generated results into actual frames. We can generate long-term reliable results with 4 pre-frames and one random frame conditioning.

Interpolate start reference image.

Start Frame

Speaker 3

Interpolate start reference image.

Start Frame