基于自适应池化注意力Transformer的唇语识别方法

姚云; 胡振虓; 邓涛; 王晓

doi:10.11959/j.issn.2096-6652.202515

您当前的位置：

首页 >

文章列表页 >

基于自适应池化注意力Transformer的唇语识别方法

学术论文 | 更新时间：2025-07-21

- 基于自适应池化注意力Transformer的唇语识别方法
- A lip reading method based on adaptive pooling attention Transformer
- 智能科学与技术学报 2025年7卷第2期页码：211-220
- 作者机构：
  
  安徽大学人工智能学院，安徽合肥 230031
- 作者简介：
  
  [ "姚云（1996- ），男，安徽大学人工智能学院硕士生，主要研究方向为唇语识别、生成式人工智能、计算机视觉。" ]
  [ "胡振虓（2000- ），男，安徽大学人工智能学院硕士生，主要研究方向为异常行为识别、计算机视觉、深度学习。" ]
  [ "邓涛（2003- ），男，安徽大学人工智能学院本科生，主要研究方向为计算机视觉、自动驾驶、智能自主系统。" ]
  [ "王晓（1988- ），女，安徽大学人工智能学院教授，主要研究方向为社会计算、群体行为建模、无人自主系统及其平行测试。" ]
- 基金信息：
- DOI：10.11959/j.issn.2096-6652.202515
  中图分类号： TP391.4
- 收稿日期：2025-02-24，
  
  修回日期：2025-04-28，
  
  纸质出版日期：2025-06-15
- 稿件说明：
移动端阅览
姚云,胡振虓,邓涛等.基于自适应池化注意力Transformer的唇语识别方法[J].智能科学与技术学报,2025,07(02):211-220.

YAO Yun,HU Zhenxiao,DENG Tao,et al.A lip reading method based on adaptive pooling attention Transformer[J].Chinese Journal of Intelligent Science and Technology,2025,07(02):211-220.
姚云,胡振虓,邓涛等.基于自适应池化注意力Transformer的唇语识别方法[J].智能科学与技术学报,2025,07(02):211-220. DOI： 10.11959/j.issn.2096-6652.202515.

YAO Yun,HU Zhenxiao,DENG Tao,et al.A lip reading method based on adaptive pooling attention Transformer[J].Chinese Journal of Intelligent Science and Technology,2025,07(02):211-220. DOI： 10.11959/j.issn.2096-6652.202515.

摘要

唇语识别技术通过分析一系列连续的唇部图像，建立唇部动作特征与特定语言文字之间的映射关系，实现语义信息的识别。现有方法主要依赖循环神经网络对时序视频帧进行时序特征建模，但存在显著的信息丢失问题，尤其是在视频信息不完整或存在噪声干扰时，模型往往会在区分不同时间点的唇语动作时发生混淆，导致识别精度显著下降。针对这一问题，提出基于自适应池化注意力Transformer的唇语识别方法（lip reading method based on adaptive pooling attention Transformer，APAT-LR）。该方法在标准Transformer的多头自注意力（multi-head self-attention，MHSA）机制之前，采用最大池化和平均池化的拼接策略，引入自适应池化模块，有效抑制无关信息，增强关键特征的表达，从而提升时序特征的建模能力。实验结果表明，APAT-LR在CMLR和GRID数据集上分别取得28.4%和1.9%的错误率，相较于现有方法都降低了错误率，验证了其在唇语识别任务中的有效性。

Abstract

Lip reading technology establishes the mapping relationship between lip movements and specific language characters by processing a series of consecutive lip images

thereby enabling semantic information recognition. Existing methods mainly use recurrent networks for spatiotemporal modeling of sequential video frames. However

they suffer from significant information loss

especially when the video information is incomplete or contains noise. In such cases

the model often struggles to distinguish between lip movements at different time points

leading to a significant decline in recognition performance. To address this issue

a lip reading method based on adaptive pooling attention transformer (APAT-LR) was proposed. This method introduced an adaptive pooling module before the multi-head self-attention (MHSA) mechanism in the standard Transformer

using a concatenation strategy of max pooling and average pooling. This module effectively suppressed irrelevant information and enhances the representation of key features. Experiments on the CMLR and GRID datasets showed that the proposed APAT-LR method could reduce the recognition error rate

thus verifying the effectiveness of the proposed method.

关键词

Keywords

references

曾淦雄, 柯逍. 基于3D卷积的图像序列特征提取与自注意力的车牌识别方法[J]. 智能科学与技术学报, 2021, 3(3): 268-279.

ZENG G X, KE X. 3D convolution-based image sequence feature extraction and self-attention for license plate recognition method[J]. Chinese Journal of Intelligent Science and Technology, 2021, 3(3): 268-279.

蔡莹皓, 杨华, 安璇, 等. 神经符号学及其应用研究[J]. 智能科学与技术学报, 2022, 4(4): 560-570.

CAI Y H, YANG H, AN X, et al. Study on NeuroSymbolic learning and its applications[J]. Chinese Journal of Intelligent Science and Technology, 2022, 4(4): 560-570.

SUMBY W H, POLLACK I. Visual contribution to speech intelligibility in noise[J]. The Journal of the Acoustical Society of America, 1954, 26(2): 212-215.

EXARCHOS T, DIMITRAKOPOULOS G N, VRAHATIS A G, et al. Lip-reading advancements: a 3D convolutional neural network/long short-term memory fusion for precise word recognition[J]. BioMedInformatics, 2024, 4(1): 410-422.

WENG X S, KITANI K. Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading[J ] . arXiv preprint, 2019, arXiv: 1905.02540 .

TATULLI E, HUEBER T. Feature extraction using multimodal convolutional neural networks for visual speech recognition[C]//Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE Press, 2017:2971-2975.

MALEK M, BEN M M A, AICHA B. Lip reading of words with lip segmentation and deep learning[J]. Multimedia Tools and Applications, 2022, 82(1): 551-571.

谢胤岑, 薛峰, 曹明伟. 基于多重视觉注意力的唇语识别[J]. 模式识别与人工智能, 2024, 37(1): 73-84.

XIE Y C, XUE F, CAO M W. Lip recognition based on multiple visual attention[J]. Pattern Recognition and Artificial Intelligence, 2024, 37(1): 73-84.

吴威龙, 李润恺, 许霜烨, 等. 基于自适应序列帧长度的端到端式唇语识别算法[J]. 生命科学仪器, 2023, 21(4): 35-39.

WU W L, LI R K, XU S Y, et al. An end-to-end lip-reading recognition algorithm based on the adaptive length of frame sequence[J]. Life Science Instruments, 2023, 21(4): 35-39.

NODA K, YAMAGUCHI Y, NAKADAI K, et al. Lipreading using convolutional neural network[C]//Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH(2014). Singapore: ISCA, 2014: 1149-1153.

XU K, LI D W, CASSIMATIS N, et al. LCANet: end-to-end lipreading with cascaded attention-CTC[C]//Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). Piscataway: IEEE Press, 2018: 548-555.

STAFYLAKIS T, TZIMIROPOULOS G. Combining residual networks with LSTMs for lipreading[ J ] . arXiv preprint, 2017, arXiv: 1703.04105 .

ZHANG T, HE L, LI X D, et al. Efficient end-to-end sentence-level lipreading with temporal convolutional networks[J]. Applied Sciences, 2021, 11(15): 6975.

MA P C, MARTINEZ B, PETRIDIS S, et al. Towards practical lipreading with distilled and efficient models[C]//Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE Press, 2021: 7608-7612.

PENG Z Q, LUO Y H, SHI Y, et al. SelfTalk: a self-supervised commutative training diagram to comprehend 3D talking faces[C]//Proceedings of the 31st ACM International Conference on Multimedia. New York: ACM, 2023: 5292-5301.

丛晓峰, 桂杰, 章军. 基于视觉Transformer的多损失融合水下图像增强网络[J]. 智能科学与技术学报, 2022, 4(4): 522-532.

CONG X F, GUI J, ZHANG J. Underwater image enhancement network based on visual Transformer with multiple loss functions fusion[J]. Chinese Journal of Intelligent Science and Technology, 2022, 4(4): 522-532.

ASSAEL Y M, SHILLINGFORD B, WHITESON S, et al. LipNet: end-to-end sentence-level lipreading[J ] . arXiv preprint, 2016, arXiv: 1611.01599 .

CHUNG J S, SENIOR A, VINYALS O, et al. Lip reading sentences in the wild[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE Press, 2017: 3444-3453.

XUE F, YANG T, LIU K, et al. LCSNet: end-to-end lipreading with channel-aware feature selection[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2023, 19(1s): 1-21.

洪依, 孙成立, 冷严. 基于超轻量通道注意力的端对端语音增强方法[J]. 智能科学与技术学报, 2021(3): 351-358.

HONG Y, SUN C L, LENG Y. End-to-end speech enhancement based on ultra-lightweight channel attention[J]. Chinese Journal of Intelligent Science and Technology, 2021(3): 351-358.

BURCHI M, TIMOFTE R. Audio-visual efficient conformer for robust speech recognition[C]//Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Piscataway: IEEE Press, 2023: 2257-2266.

LU J C, SISMAN B, LIU R, et al. Visualtts: TTs with accurate lip-speech synchronization for automatic voice over[C]// Proceedings of ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE Press, 2022: 8032-8036.

SHI B W, HSU W N, LAKHOTIA K, et al. Learning audio-visual speech representation by masked multimodal cluster prediction[J ] . arXiv preprint, 2022, arXiv: 2201.02184 .

WANG H J, CUI B Y, YUAN Q B, et al. Mini-3DCvT: a lightweight lip-reading method based on 3D convolution visual transformer[J]. The Visual Computer, 2025, 41(3): 1957-1969.

LI Z Y, LOHRENZ T, DUNKELBERG M, et al. Transformer-based lip-reading with regularized dropout and relaxed attention[C]//Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT). Piscataway: IEEE Press, 2023: 723-730.

HE L, DING B Y, WANG H, et al. An optimal 3D convolutional neural network based lipreading method[J]. IET Image Processing, 2022, 16(1): 113-122.

ZHAO Y, XU R, SONG M L. A cascade sequence-to-sequence model for Chinese mandarin lip reading[C]// Proceedings of the 1st ACM International Conference on Multimedia in Asia. New York: ACM, 2019: 1-6.

COOKE M, BARKER J, CUNNINGHAM S, et al. An audio-visual corpus for speech perception and automatic speech recognition[J]. The Journal of the Acoustical Society of America, 2006, 120(5 Pt 1): 2421-2424.

YANG L L, XU X S, ZHAO J Z, et al. Fusion of RetinaFace and improved FaceNet for individual cow identification in natural scenes[J]. Information Processing in Agriculture, 2024, 11(4): 512-523.

BULAT A, TZIMIROPOULOS G. How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230, 000 3D facial landmarks)[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV). Piscataway: IEEE Press, 2017: 1021-1030.

ZHANG X Y, ZHANG C W, SUI J P, et al. Boosting lip reading with a multi-view fusion network[C]//Proceedings of 2022 IEEE International Conference on Multimedia and Expo (ICME). Piscataway: IEEE Press, 2022: 1-6.

PRAJWAL K R, AFOURAS T, ZISSERMAN A. Sub-word level lip reading with visual attention[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE Press, 2022: 5152-5162.

MA P C, PETRIDIS S, PANTIC M. Visual speech recognition for multiple languages in the wild[J]. Nature Machine Intelligence, 2022, 4(11): 930-939.

何珊, 袁家斌, 陆要要. 基于中文发音视觉特点的唇语识别方法研究[J]. 计算机工程与应用, 2022, 58(4): 157-162.

HE S, YUAN J B, LU Y Y. Research on lip reading based on visual characteristics of Chinese pronunciation[J]. Computer Engineering and Applications, 2022, 58(4): 157-162.

ZHAO Y, XU R, WANG X C, et al. Hearing lips: improving lip reading by distilling speech recognizers[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 6917-6924.

刘培培, 贾静平. 基于时域卷积网络的中文句子级唇语识别算法[J]. 计算机应用研究, 2023, 40(9): 2596-2602.

LIU P P, JIA J P. Chinese sentence-level lip recognition algorithm based on time domain convolutional network[J]. Application Research of Computers, 2023, 40(9): 2596-2602.

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于自适应池化注意力的Transformer的唇语识别方法

HCANet：基于分层Transformer架构的微表情识别模型

基于跨空间多尺度信息聚合和推理一致性的域泛化方法

双级门控分段式多模态情绪识别方法

注意力机制增强的输煤传送带异物检测