GaussianEmoTalker: Real-Time Emotional Talking Head Synthesis with Audio-Driven and Blendshape-Based 3D Gaussian Splatting

Haijie Yang1, Zhenyu Zhang2*, Yixuan Dong3, Jianjun Qian1, Jian Yang1*,
1-PCALab of Nanjing University of Science and Technology
2-Nanjing University 3-Tsientang Institute for Advanced Study
Under Submission
Code arXiv

Abstract

Audio-driven talking head synthesis has achieved impressive progress in lip synchronization and visual quality, yet generating expressive emotional avatars with controllable intensity remains challenging, especially under real-time constraints. In this paper, we present GaussianEmoTalker, an audio-driven framework for real-time emotional talking head synthesis based on 3D Gaussian Splatting. Instead of directly predicting the final emotional avatar from speech, we formulate emotional animation as a neutral-to-emotional residual deformation problem. GaussianEmoTalker first constructs an identity-specific neutral talking space with GaussianBlendshapes, which provides high-fidelity Gaussian attributes and phoneme-synchronized neutral motion. It then predicts an emotion-conditioned residual deformation by combining mesh displacement cues, audio features, emotion categories, and intensity encodings. To fuse these heterogeneous signals, we introduce a spatial-audio-emotion attention module that estimates the offsets of Gaussian attributes for expressive and temporally stable rendering. Extensive experiments demonstrate that GaussianEmoTalker achieves competitive video quality, accurate lip synchronization, controllable emotional expression, and real-time rendering compared with recent emotional talking head methods.

pipeline

Methodology

pipeline

GaussianEmoTalker uses the expression basis of GaussianBlendshapes to construct the neutral state space for the talking head (Stage 1). Through a pre-trained audio-to-expression model, it obtains the neutral expression coefficients and emotional expression coefficients. The former is initialized with the corresponding mesh and Gaussian attributes obtained in Stage 1. The latter is derived from the LBS (Linear Blend Skinning) in the emotion state space constructed by 3D Gaussian, resulting in a deformed mesh. The difference between the two meshes, together with the audio, emotion, and intensity labels, is processed by cross-attention to obtain the final Gaussian attribute offset for the specified emotion, generating the final emotional talking head (Stage 2).

The Results of Our Method

sad level 1

sad level 2

sad level 3

happy level 1

happy level 2

happy level 3

fear level 1

fear level 2

fear level 3

surprised level 1

surprised level 2

surprised level 3

angry level 1

angry level 2

angry level 3

disgusted level 1

disgusted level 2

disgusted level 3

comtempt level 1

comtempt level 2

comtempt level 3

+30 view

+60 view

-30 view

-60 view