双级门控分段式多模态情绪识别方法

马飞; 李树志; 杨飞霞; 徐光宪

doi:10.11959/j.issn.2096-6652.202514

您当前的位置：

首页 >

文章列表页 >

双级门控分段式多模态情绪识别方法

学术论文 | 更新时间：2025-07-21

- 双级门控分段式多模态情绪识别方法
- Dual-stage gated segmented multimodal emotion recognition method
- 智能科学与技术学报 2025年7卷第2期页码：257-267
- 作者机构：
  
  1.辽宁工程技术大学电子与信息工程学院，辽宁葫芦岛 125105
  2.辽宁工程技术大学电气与控制工程学院，辽宁葫芦岛 125105
- 作者简介：
  
  [ "马飞（1978- ），男，博士，辽宁工程技术大学电子与信息工程学院副教授，主要研究方向为多模态数据融合、情感计算、教育人工智能、遥感图像处理等。" ]
  [ "李树志（1999- ），男，辽宁工程技术大学电子与信息工程学院硕士生，主要研究方向为多模态数据融合、情绪识别与深度学习。" ]
  [ "杨飞霞（1979- ），女，博士，辽宁工程技术大学电气与控制工程学院副教授，主要研究方向为多模态数据处理、图像处理与模式识别、深度学习与最优化。" ]
  [ "徐光宪（1977- ），男，博士，辽宁工程技术大学电子与信息工程学院教授，主要研究方向为多模态数据处理、图像处理与模式识别、数据处理与网络编码、机器视觉。" ]
- 基金信息：
  
  辽宁省教育科学“十四五”规划课题(JG24DB219);辽宁省科技厅自然科学基金计划面上项目(2023-MS-314);辽宁省教育厅高校科研业务经费项目(LJ242410147006);辽宁工程技术大学GPU资源支持项目
- DOI：10.11959/j.issn.2096-6652.202514
  中图分类号： TP391
- 收稿日期：2025-01-20，
  
  修回日期：2025-03-19，
  
  纸质出版日期：2025-06-15
- 稿件说明：
移动端阅览
马飞,李树志,杨飞霞等.双级门控分段式多模态情绪识别方法[J].智能科学与技术学报,2025,07(02):257-267.

MA Fei,LI Shuzhi,YANG Feixia,et al.Dual-stage gated segmented multimodal emotion recognition method[J].Chinese Journal of Intelligent Science and Technology,2025,07(02):257-267.
马飞,李树志,杨飞霞等.双级门控分段式多模态情绪识别方法[J].智能科学与技术学报,2025,07(02):257-267. DOI： 10.11959/j.issn.2096-6652.202514.

MA Fei,LI Shuzhi,YANG Feixia,et al.Dual-stage gated segmented multimodal emotion recognition method[J].Chinese Journal of Intelligent Science and Technology,2025,07(02):257-267. DOI： 10.11959/j.issn.2096-6652.202514.

摘要

多模态情绪识别技术在心理健康检测与机器情感分析中应用广泛，但现有方法多依赖全局或局部特征，忽略了二者的联合建模，限制了情绪识别性能。为此，提出了一种基于Transformer的双级门控分段式多模态情绪识别模型（dual-stage gated segmented multimodal emotion recognition method，DGM）。DGM采用分段式融合架构，包括交互阶段与双级门控阶段。交互阶段采用OAGL融合策略建模全局-局部跨模态交互，优化特征融合效率；双级门控阶段整合局部与全局特征，充分利用情绪信息。此外，针对模态间局部时序特征不对齐问题，设计了基于缩放点积的序列对齐方法以提升融合精度。在CMU-MOSI、CMU-MOSEI和CH-SIMS 3个基准数据集上的实验表明，DGM在多数据集上的识别效果优于现有算法，验证了其捕捉情绪细节的能力与泛化性能。

Abstract

Multimodal emotion recognition has broad applications in mental health detection and affective computing. However

most existing methods rely on either global or local features

neglecting the joint modeling of both

which limits emotion recognition performance. To address this

a Transformer-based dual-stage gated segmented multimodal emotion recognition method (DGM). DGM adopts a segmented fusion architecture was proposed

consisting of an interaction stage and a dual-stage gating stage. In the interaction stage

the OAGL fusion strategy was employed to model global-local cross-modal interactions

improving the efficiency of feature fusion. The dual-stage gating stage integrates local and global features was designed to fully utilize emotional information. Additionally

to resolve the misalignment of local temporal features across modalities

a scaled dot-product-based sequence alignment method was developed to enhance fusion accuracy. Experimental were conducted on three benchmark datasets (CMU-MOSI

CMU-MOSEI

and CH-SIMS)

and the results demonstrate that DGM outperforms representative algorithms on multiple datasets

validating its ability to capture emotional details and its strong generalization capability.

关键词

Keywords

references

潘家辉, 何志鹏, 李自娜, 等. 多模态情绪识别研究综述[J].智能系统学报, 2020, 15(4): 633-645.

PAN J H, HE Z P, LI Z N, et al. A review of multimodal emotion recognition[J]. CAAI Transactions on Intelligent Systems, 2020, 15(4): 633-645.

PILLALAMARRI R, SHANMUGAM U. A review on EEG-based multimodal learning for emotion recognition[J]. Artificial Intelligence Review, 2025, 58(5): 131.

HOSSAIN M R, HOQUE M M, DEWAN M A A, et al. AuthorNet: leveraging attention-based early fusion of transformers for low-resource authorship attribution[J]. Expert Systems with Applications, 2025, 262: 125643.

GAN Y, YOU Y N, HUANG J J, et al. Multi-view clustering via multi-stage fusion[J]. IEEE Transactions on Multimedia, 2025(99): 1-13.

FU Z W, LIU F, XU Q, et al. LMR-CBT: learning modality-fused representations with CB-Transformer for multimodal emotion recognition from unaligned multimodal sequences[J]. Frontiers of Computer Science, 2023, 18(4): 184314.

LIU F, FU Z W, WANG Y L, et al. TACFN: Transformer-based adaptive cross-modal fusion network for multimodal emotion recognition[J]. CAAI Artificial Intelligence Research, 2023, 2: 9150019.

YANG Z H, HE Q, DU N S, et al. Temporal text-guided feedback-based progressive fusion network for multimodal sentiment analysis[J]. Alexandria Engineering Journal, 2025, 116: 699-709.

ZHU L N, ZHAO H Y, ZHU Z C, et al. Multimodal sentiment analysis with unimodal label generation and modality decomposition[J]. Information Fusion, 2025, 116: 102787.

WANG R Q, YANG Q M, TIAN S W, et al. Transformer-based correlation mining network with self-supervised label generation for multimodal sentiment analysis[J]. Neurocomputing, 2025, 618: 129163.

FU Z W, LIU F, XU Q, et al. NHFNET: a non-homogeneous fusion network for multimodal sentiment analysis[C]//Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME). Piscataway: IEEE Press, 2022: 1-6.

HAZARIKA D, ZIMMERMANN R, PORIA S. Misa: modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 1122-1131.

LIU F, SHEN S Y, FU Z W, et al. LGCCT: a light gated and crossed complementation transformer for multimodal speech emotion recognition[J]. Entropy (Basel), 2022, 24(7): 1010.

PHAM H, LIANG P P, MANZINI T, et al. Found in translation: learning robust joint representations by cyclic translations between modalities[J]. Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019, 33(1): 6892-6899.

TSAI Y H, BAI S J, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL Press, 2019: 6558-6569.

YU W M, XU H, YUAN Z Q, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(12): 10790-10797.

SUN L C, LIAN Z, LIU B, et al. Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis[J]. IEEE Transactions on Affective Computing, 2024, 15(1): 309-325.

DEVLIN J, CHANG M W, LEE K, et al. Bert: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Volume 1 Long and Short Papers). Stroudsburg: ACL Press, 2019: 4171-4186.

角远韬, 李润梅, 王剑. 基于模糊自然语言处理的铁路CTC接口文本智能测试方法[J]. 智能科学与技术学报, 2024, 6(2): 201-209.

JIAO Y T, LI R M, WANG J. Intelligent testing method for railway CTC interface data based on fuzzy natural language processing[J]. Chinese Journal of Intelligent Science and Technology, 2024, 6(2): 201-209.

宋明, 刘彦隆. Bert在微博短文本情感分类中的应用与优化[J]. 小型微型计算机系统, 2021, 42(4): 714-718.

SONG M, LIU Y L. Application and optimization of bert in sentiment classification of weibo short text[J]. Journal of Chinese Computer Systems, 2021, 42(4): 714-718.

HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.

VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: Curran Associates Inc., 2017: 6000-6010.

LYU K, LI Z, ARORA S. Understanding the generalization benefit of normalization layers: Sharpness reduction[J]. Advances in Neural Information Processing Systems, 2022, 35: 34689-34708.

YUAN Z Q, LI W, XU H, et al. Transformer-based feature reconstruction network for robust multimodal sentiment analysis[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York: ACM, 2021: 4400-4407.

ZADEH A, ZELLERS R, PINCUS E, et al. Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages[J]. IEEE Intelligent Systems, 31(6): 82-88.

ZADEH A A B, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: cmu-mosei dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg: ACL Press, 2018: 2236-2246.

YU W M, XU H, MENG F Y, et al. CH-SIMS: a Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL Press, 2020: 3718-3727.

WANG P, ZHOU Q, WU Y, et al. DLF: disentangled-language-focused multimodal sentiment analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2025, 39(20): 21180-21188.

HUANG C, LIN Z, HAN Z, et al. PAMoE-MSA: polarity-aware mixture of experts network for multimodal sentiment analysis[J]. International Journal of Multimedia Information Retrieval, 2025, 14(1): 1-16.

MAO H S, YUAN Z Q, XU H, et al. M-SENA: an integrated platform for multimodal sentiment analysis[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Stroudsburg: ACL Press, 2022: 204-213.

DEGOTTEX G, KANE J, DRUGMAN T, et al. COVAREP: a collaborative voice analysis repository for speech technologies[C]//Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE Press, 2014: 960-964.

MCFEE B, RAFFEL C, LIANG D, et al. Librosa: audio and music signal analysis in Python[C]// Proceedings of the 14th Python in Science Conference. Piscataway: IEEE Press, 2015: 18-24.

SUN Y, MAI S J, HU H F. Learning to learn better unimodal representations via adaptive multimodal meta-learning[J]. IEEE Transactions on Affective Computing, 2023, 14(3): 2209-2223.

DU J, JIN J H, ZHUANG J, et al. Hierarchical graph contrastive learning of local and global presentation for multimodal sentiment analysis[J]. Scientific Reports, 2024, 14: 5335.

WANG P C, LIU S X, CHEN J Y. CCDA: a novel method to explore the cross-correlation in dual-attention for multimodal sentiment analysis[J]. Applied Sciences, 2024, 14(5): 1934.

VAN DER MAATEN L, HINTON G. Visualizing data using t-SNE[J]. Journal of Machine Learning Research, 2008, 9(86): 2579-2605.

柯善军, 聂成洋, 王钰苗, 等. 基于集成学习与聚类联合标注的多模态个体情绪识别[J]. 智能科学与技术学报, 2024, 6(1): 76-87.

KE S J, NIE C Y, WANG Y M, er al. Multimodal individual emotion recognition with joint labeling based on integrated learning and clustering[J]. Chinese Journal of Intelligent Science and Technology, 2024, 6(1): 76-87.

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

双级门控分段式多模态情绪识别方法

基于自适应池化注意力Transformer的唇语识别方法

基于深度学习的自动驾驶多模态轨迹预测方法：现状及展望

基于自适应池化注意力的Transformer的唇语识别方法