Spatially and Temporally Optimized Audio-Driven Talking Face Generation

No Thumbnail Available
Date
2024
Journal Title
Journal ISSN
Volume Title
Publisher
The Eurographics Association and John Wiley & Sons Ltd.
Abstract
Audio-driven talking face generation is essentially a cross-modal mapping from audio to video frames. The main challenge lies in the intricate one-to-many mapping, which affects lip sync accuracy. And the loss of facial details during image reconstruction often results in visual artifacts in the generated video. To overcome these challenges, this paper proposes to enhance the quality of generated talking faces with a new spatio-temporal consistency. Specifically, the temporal consistency is achieved through consecutive frames of the each phoneme, which form temporal modules that exhibit similar lip appearance changes. This allows for adaptive adjustment in the lip movement for accurate sync. The spatial consistency pertains to the uniform distribution of textures within local regions, which form spatial modules and regulate the texture distribution in the generator. This yields fine details in the reconstructed facial images. Extensive experiments show that our method can generate more natural talking faces than previous state-of-the-art methods in both accurate lip sync and realistic facial details.
Description

CCS Concepts: Animation → Facial Animation; Imaging & Video → Image/Video Editing

        
@article{
10.1111:cgf.15228
, journal = {Computer Graphics Forum}, title = {{
Spatially and Temporally Optimized Audio-Driven Talking Face Generation
}}, author = {
Dong, Biao
and
Ma, Bo-Yao
and
Zhang, Lei
}, year = {
2024
}, publisher = {
The Eurographics Association and John Wiley & Sons Ltd.
}, ISSN = {
1467-8659
}, DOI = {
10.1111/cgf.15228
} }
Citation
Collections