Spatially and Temporally Optimized Audio-Driven Talking Face Generation

Dong, Biao; Ma, Bo-Yao; Zhang, Lei

Spatially and Temporally Optimized Audio-Driven Talking Face Generation

Files

cgf15228.pdf (8.96 MB)

Date

2024

Authors

Dong, Biao
Ma, Bo-Yao
Zhang, Lei

Publisher

The Eurographics Association and John Wiley & Sons Ltd.

Abstract

Audio-driven talking face generation is essentially a cross-modal mapping from audio to video frames. The main challenge lies in the intricate one-to-many mapping, which affects lip sync accuracy. And the loss of facial details during image reconstruction often results in visual artifacts in the generated video. To overcome these challenges, this paper proposes to enhance the quality of generated talking faces with a new spatio-temporal consistency. Specifically, the temporal consistency is achieved through consecutive frames of the each phoneme, which form temporal modules that exhibit similar lip appearance changes. This allows for adaptive adjustment in the lip movement for accurate sync. The spatial consistency pertains to the uniform distribution of textures within local regions, which form spatial modules and regulate the texture distribution in the generator. This yields fine details in the reconstructed facial images. Extensive experiments show that our method can generate more natural talking faces than previous state-of-the-art methods in both accurate lip sync and realistic facial details.

CCS Concepts: Animation → Facial Animation; Imaging & Video → Image/Video Editing

        @article{10.1111:cgf.15228
,
journal = {Computer Graphics Forum},
title = {{Spatially and Temporally Optimized Audio-Driven Talking Face Generation
}},
author = {Dong, Biao and 
Ma, Bo-Yao and 
Zhang, Lei
},
year = {2024
},
publisher = {The Eurographics Association and John Wiley & Sons Ltd.
},
ISSN = {1467-8659
},
DOI = {10.1111/cgf.15228
}
}