ExpDiff: Generating High-fidelity Dynamic Facial Expressions with BRDF Textures via Diffusion Model

Yuhao Cheng, Xuanchen Li, Xingyu Ren Zhuo Chen, Xiaokang Yang, Yichao Yan
(Corresponding author)
Teaser

An example of our dynamic facial expressions generation method. Given a neutral-expression mesh and texture maps as input, our framework enables the generation of FACS-compliant expression meshes with pore-level dynamic BRDF textures through expression text prompts, achieving physically-based photo-realistic rendering.

Abstract

3D facial generation is a critical task for immersive multimedia applications, where the key challenge lies in synthesizing vivid expression meshes with corresponding dynamical textures. Current approaches still have limitations in geometric-textural coherence and dynamic reflectance generation. To address these challenges, we present ExpDiff, a framework that generates expression meshes and dynamic BRDF textures from a single neutral-expression face. To achieve effective generation, we propose an attention-based diffusion model to learn the relationship among different expressions. To ensure correspondence between geometry and texture, we introduce a unified representation that explicitly models geometric-textural interaction, and then encodes them by models trained on large-scale datasets into a latent space to maintain generalization. To achieve semantically coherent and physically consistent generation, we propose to guide the denoising direction with specially designed textual prompts. We further construct two novel dynamic expression datasets to train the models, setting new standards for asset quality (J-Reflectance) and identity diversity (FFHQ-BRDFExp), which are publicly released to advance the community. Extensive experiments demonstrate our method's superior performance in photorealistic facial animation synthesis.

Pipeline

Pipeline
Overview of our proposed ExpDiff. Given neutral models and expression models, we first represent them in the unified representation and project them into the latent space through a frozen encoder. Then, we propose an attention-based diffusion model for expression asset generation. The textual prompts are encoded by CLIP to obtain semantic information, leading the denoising direction of the diffusion model.

FFHQ-BRDfExp Dataset

Pipeline
Overview of our capture pipeline and datasets. (a) We utilize a Light Stage system to capture high-resolution facial images and employ skilled artists to meticulously process the data to ensure film-quality assets. (b) We showcase an identity's expression results, showing muscle-level geometry and BRDF textures with pore-level details, which can integrate seamlessly with the relighting applications.

J-Reflectance Dataset

Pipeline
An example of BRDF texture and geometry in FFHQ-BRDFExp dataset.

BibTeX


@inproceedings{cheng2025expdiff,
  title={ExpDiff: Generating High-fidelity Dynamic Facial Expressions with BRDF Textures via Diffusion Model},
  author={Cheng, Yuhao and Li, Xuanchen and Ren, Xingyu and Chen, Zhuo and Yang, Xiaokang and Yan, Yichao},
  year={2025}
}