基于自适应池化注意力的Transformer的唇语识别方法

姚云; 胡振虓; 邓涛; 王晓

doi:null

您当前的位置：

首页 >

文章列表页 >

基于自适应池化注意力的Transformer的唇语识别方法

更新时间：2025-05-07

- 基于自适应池化注意力的Transformer的唇语识别方法
- The lip reading method based on Adaptive Pooling Attention Transformer
- 智能科学与技术学报 2025年
- 作者机构：
  
  1.安徽大学人工智能学院,安徽省合肥市,中国
  2.安徽大学人工智能学院,230031
- 作者简介：
  
  [ "姚云（1996- ），男，安徽大学人工智能学院硕士生，主要研究方向为唇语识别、生成式人工智能、计算机视觉。" ]
  [ "胡振虓（2000- ），男，安徽大学人工智能学院研究生，主要研究方向为异常行为识别、计算机视觉、深度学习。" ]
  [ "王晓（1988- ），女，安徽大学人工智能学院教授，主要研究方向为社会计算、群体行为建模、无人自主系统及其平行测试。" ]
- 基金信息：
- DOI：
  中图分类号：
- 收稿日期：2025-02-24，
  
  修回日期：2025-03-31，
  
  录用日期：2025-04-28，
- 稿件说明：
移动端阅览
姚云, 胡振虓, 邓涛, 等. 基于自适应池化注意力的Transformer的唇语识别方法[J/OL]. 智能科学与技术学报, 2025.

YAO Yun, HU Zhenxiao, DENG Tao, et al. The lip reading method based on Adaptive Pooling Attention Transformer[J/OL]. Chinese journal of intelligent science and technology, 2025.
姚云, 胡振虓, 邓涛, 等. 基于自适应池化注意力的Transformer的唇语识别方法[J/OL]. 智能科学与技术学报, 2025. DOI：

YAO Yun, HU Zhenxiao, DENG Tao, et al. The lip reading method based on Adaptive Pooling Attention Transformer[J/OL]. Chinese journal of intelligent science and technology, 2025. DOI：

摘要

唇语识别技术通过分析一系列连续的唇部图像，建立唇部动作特征与特定语言文字之间的映射关系，实现语义信息的识别。现有方法主要依赖循环神经网络对时序视频帧进行时序特征建模，但存在显著的信息丢失问题，尤其在视频信息不完整或存在噪声干扰时，模型往往会在区分不同时间点的唇语动作时发生混淆，导致识别精度显著下降。针对这一问题，提出基于自适应池化注意力Transformer的唇语识别方法（Lip Reading Method Based on Adaptive Pooling Attention Transformer

APAT-LR）。该方法在标准Transformer的多头自注意力机制（Multi-Head Self-Attention

MHSA）之前，采用最大池化和平均池化的拼接策略，引入自适应池化模块（Adaptive Pooling Module），有效抑制无关信息，增强关键特征的表达，从而提升时序特征的建模能力。该实验结果表明，APAT-LR在CMLR和GRID数据集上分别取得28.4%和1.9%的字符错误率，相较于现有方法分别降低了错误率，验证了其在唇语识别任务中的有效性。

Abstract

Lip reading technology establishes the mapping relationship between lip movements and specific language characters by processing a series of consecutive lip images

thereby enabling semantic information recognition. Existing methods mainly use recurrent networks for spatiotemporal modeling of sequential video frames. However

they suffer from significant information loss

especially when the video information is incomplete or contains noise. In such cases

the model often struggles to distinguish between lip movements at different time points

leading to a significant decline in recognition performance. To address this issue

a lip reading method based on Adaptive Pooling Attention Transformer (APAT-LR) is proposed. This method introduces an Adaptive Pooling Module before the Multi-Head Self-Attention (MHSA) mechanism in the standard Transformer

using a concatenation strategy of max pooling and average pooling. This module helps suppress irrelevant information and enhances the representation of key features. Experiments on the CMLR and GRID datasets show that the proposed APAT-LR method can reduce the recognition error rate

thus verifying the effectiveness of the proposed method.

关键词

Keywords

references

Sumby W H , Pollack I . Visual contribution to speech intelligibility in noise . The Journal of the Acoustical Society of America , 1954 , 26 ( 2 ): 212 - 215 .

Zhao Y , Xu R , Song M . A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading . ACM , 2019 . DOI: 10.1145/3338533.3366579 http://dx.doi.org/10.1145/3338533.3366579 .

WENG Xinshuo and KITANI K . Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading . Proceedings of the 30th British Machine Vision Conference , Cardiff, UK , 2019 .

Tatulli E , Hueber T . Feature extraction using multimodal convolutional neural networks for visual speech recognition . Proceedings of IEEE International Conference on Acoustics , Speech , and Signal Processing .

Miled M , Messaoud M A B , Bouzid A . Lip reading of words with lip segmentation and deep learning [J ] . Multimedia Tools and Applications , 2023 , 82 ( 1 ): 551 - 571 .

谢胤岑 , 薛峰 , 曹明伟 . 基于多重视觉注意力的唇语识别 [J ] . 模式识别与人工智能 , 2024 , 37 ( 01 ): 73 - 84 . DOI: 10.16451/j.cnki.issn1003-6059.202401006 http://dx.doi.org/10.16451/j.cnki.issn1003-6059.202401006 .

Noda K , Yamaguchi Y , Nakadai K , et al . Lipreading using convolutional neural network . Interspeech , 2014 , 1 : 3 .

Xu K , Li D , Cassimatis N , et al . LCANet: End-to-end lipreading with cascaded attention-CTC [C ] // 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018) . IEEE , 2018 : 548 - 555 .

Stafylakis T , Tzimiropoulos G . Combining residual networks with LSTMs for lipreading [J ] . arXiv preprint arXiv: 1703.04105 , 2017 .

Zhang T , He L , Li X , et al . Efficient end-to-end sentence-level lipreading with temporal convolutional networks [J ] . Applied Sciences , 2021 , 11 ( 15 ): 6975 .

Ma P , Martinez B , Petridis S , et al . Towards practical lipreading with distilled and efficient models [C ] // ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE , 2021 : 7608 - 7612 .

丛晓峰 , 桂杰 , 章军 . 基于视觉Transformer的多损失融合水下图像增强网络 [J ] . 智能科学与技术学报 , 2022 , 4 ( 4 ): 522 - 532 . DOI: 10.11959/j.issn.2096-6652.202252 http://dx.doi.org/10.11959/j.issn.2096-6652.202252 .

Assael Y M , Shillingford B , Whiteson S , et al . Lipnet: End-to-end sentence-level lipreading [J ] . arXiv preprint arXiv: 1611.01599 , 2016 .

Son Chung J , Senior A , Vinyals O , et al . Lip reading sentences in the wild [C ] // Proceedings of the IEEE conference on computer vision and pattern recognition . 2017 : 6447 - 6456 .

XUE F. , YANG T. , LIU K. , et al . LCSNet: End-to-End Lipreading with Channel-Aware Feature Selection . ACM Transactions on Multimedia Computing , Communications, and Applications, 2023 . DOI: 10.1145/3524620 http://dx.doi.org/10.1145/3524620 .

MA P C , PETRIDIS S , PANTIC M . Visual Speech Recognition for Multiple Languages in the Wild . Nature Machine Intelligence . 2022 ， 4 : 930 - 939 .

何珊 , 袁家斌 , 陆要要 . 基于中文发音视觉特点的唇语识别方法研究 [J ] . Journal of Computer Engineering & Applications , 2022 , 58 ( 4 ).

Zhao Y , Xu R , Wang X , et al . Hearing lips: Improving lip reading by distilling speech recognizers [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . 2020 , 34 ( 04 ): 6917 - 6924 .

刘培培 , 贾静平 . 基于时域卷积网络的中文句子级唇语识别算法 [J ] . 计算机应用研究 , 2023 , 40 ( 9 ): 2596 - 2602 . DOI: 10.19734/j.issn.1001-3695.2023.02.0051 http://dx.doi.org/10.19734/j.issn.1001-3695.2023.02.0051 .

宁佐金 , 蒋近 , 彭思齐 . 基于标注人脸轮廓的唇语识别方法 [J ] . 信息技术与信息化 , 2023 ,( 11 ): 199 - 203 .

许文稼 , 李克 . 基于光流与注意力机制的句级唇语识别 [J ] . 电子器件 , 2023 , 46 ( 05 ): 1339 - 1348 .

Exarchos T , Dimitrakopoulos G N , Vrahatis A G , et al . Lip-reading advancements: a 3D convolutional neural network/long short-term memory fusion for precise word recognition [J ] . BioMedInformatics , 2024 , 4 ( 1 ): 410 - 422 .

吴威龙 , 李润恺 , 许霜烨 , 等 . 基于自适应序列帧长度的端到端式唇语识别算法 [J ] . 生命科学仪器 , 2023 , 21 ( 04 ): 35 - 39 .

蔡莹皓 , 杨华 , 安璇 , 等 . 神经符号学及其应用研究 [J ] . 智能科学与技术学报 , 2022 , 4 ( 4 ): 560 - 570 . DOI: 10.11959/j.issn.2096-6652.202234 http://dx.doi.org/10.11959/j.issn.2096-6652.202234 .

Wang H , Cui B , Yuan Q , et al . Mini-3DCvT: a lightweight lip-reading method based on 3D convolution visual transformer [J ] . The Visual Computer , 2024 : 1 - 13 .

曾淦雄 , 柯逍 . 基于3D卷积的图像序列特征提取与自注意力的车牌识别方法 [J ] . 智能科学与技术学报 , 2021 , 3 ( 3 ): 268 - 279 . DOI: 10.11959/j.issn.2096-6652.202128 http://dx.doi.org/10.11959/j.issn.2096-6652.202128 .

Wang J , Qian X , Zhang M , et al . Seeing what you said: Talking face generation guided by a lip reading expert [C ] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 2023 : 14653 - 14662 .

Prajwal K R , Afouras T , Zisserman A . Sub-word level lip reading with visual attention [C ] // Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition . 2022 : 5162 - 5172 .

Yemini Y , Shamsian A , Bracha L , et al . LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading [C ] // The Twelfth International Conference on Learning Representations .

Yin Y , Wang Z , Xia K , et al . Acoustic-based lip reading for mobile devices: dataset, benchmark and a self distillation-based approach [J ] . IEEE Transactions on Mobile Computing , 2023 , 23 ( 5 ): 4548 - 4565 .

Zhang X , Zhang C , Sui J , et al . Boosting Lip Reading with a Multi-View Fusion Network [C ] // 2022 IEEE International Conference on Multimedia and Expo (ICME) . IEEE Computer Society , 2022 : 1 - 6 .

Li Z , Lohrenz T , Dunkelberg M , et al . Transformer-Based Lip-Reading with Regularized Dropout and Relaxed Attention [C ] // 2022 IEEE Spoken Language Technology Workshop (SLT) . IEEE , 2023 : 723 - 730 .

Xia B , Yang S , Shan S , et al . UniLip: Learning Visual-Textual Mapping with Uni-Modal Data for Lip Reading [C ] // BMVC . 2023 : 190 - 191 .

Jinlin M A , Yuhao L I U , Ziping M A , et al . HSKDLR: Lightweight Lip Reading Method Based on Homogeneous Self-Knowledge Distillation [J ] . Journal of Frontiers of Computer Science & Technology , 2023 , 17 ( 11 ): 2689 .

He L , Ding B , Wang H , et al . An optimal 3D convolutional neural network based lipreading method [J ] . IET Image Processing , 2022 , 16 ( 1 ): 113 - 122 .

Cheng X , Jin T , Li L , et al . OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment [C ] // The 61st Annual Meeting Of The Association For Computational Linguistics . 2023 .

Shi B , Hsu W N , Lakhotia K , et al . Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction [C ] // International Conference on Learning Representations .

Burchi M , Timofte R . Audio-visual efficient conformer for robust speech recognition [C ] // Proceedings of the IEEE/CVF winter conference on applications of computer vision . 2023 : 2258 - 2267 .

洪依 , 孙成立 , 冷严 . 基于超轻量通道注意力的端对端语音增强方法 [J ] . 智能科学与技术学报 , 2021 , 3 ( 3 ): 351 - 358 . DOI: 10.11959/j.issn.2096-6652.202136 http://dx.doi.org/10.11959/j.issn.2096-6652.202136 .

Peng Z , Luo Y , Shi Y , et al . Selftalk: A self-supervised commutative training diagram to comprehend 3d talking faces [C ] // Proceedings of the 31st ACM International Conference on Multimedia . 2023 : 5292 - 5301 .

Lu J , Sisman B , Liu R , et al . Visualtts: Tts with accurate lip-speech synchronization for automatic voice over [C ] // ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE , 2022 : 8032 - 8036 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

融合句法增强与语义增强的方面情感分析

基于改进YOLOv5s的小人脸检测

基于稠密块和注意力机制的肺部病理图像异常细胞分割

基于深度学习的自动驾驶多模态轨迹预测方法：现状及展望

基于情感信息融合注意力机制的抑郁症识别