任福继,于曼丽,胡敏,李艳秋(合肥工业大学计算机与信息学院情感计算与先进智能机器安徽省重点实验室, 合肥 230009;德岛大学先端技术科学教育部, 德岛 7708502, 日本)
目的 针对当前视频情感判别方法大多仅依赖面部表情、而忽略了面部视频中潜藏的生理信号所包含的情感信息，本文提出一种基于面部表情和血容量脉冲（BVP）生理信号的双模态视频情感识别方法。方法 首先对视频进行预处理获取面部视频；然后对面部视频分别提取LBP-TOP和HOG-TOP两种时空表情特征，并利用视频颜色放大技术获取BVP生理信号，进而提取生理信号情感特征；接着将两种特征分别送入BP分类器训练分类模型；最后利用模糊积分进行决策层融合，得出情感识别结果。结果 在实验室自建面部视频情感库上进行实验，表情单模态和生理信号单模态的平均识别率分别为80%和63.75%，而融合后的情感识别结果为83.33%，高于融合前单一模态的情感识别精度，说明了本文融合双模态进行情感识别的有效性。结论 本文提出的双模态时空特征融合的情感识别方法更能充分地利用视频中的情感信息，有效增强了视频情感的分类性能，与类似的视频情感识别算法对比实验验证了本文方法的优越性。另外，基于模糊积分的决策层融合算法有效地降低了不可靠决策信息对融合的干扰，最终获得更优的识别精度。
Dual-modality video emotion recognition based on facial expression and BVP physiological signal
Ren Fuji,Yu Manli,Hu Min,Li Yanqiu(School of Computer and Information of Hefei University of Technology, Anhui Province Key Laboratory of Affective Computing and Advanced Intelligent Machine, Hefei 230009, China;University of Tokushima, Graduate School of Advanced Technology & Science, Tokushima 7708502, Japan)
Objective With the continuous development of artificial intelligence, researchers and scholars from other fields have become increasingly interested in providing computers with the capability to understand the emotions conveyed by(human beings and naturally interact with them. Therefore, emotion recognition has gradually become one of the key points of research to achieve harmonious human-computer interaction. The performance of video emotion recognition algorithms critically depends on the quality of the extracted emotion information. Previous research showed that facial expression is the most direct method to convey emotional information. Thus, current works usually rely on facial expressions only to complete emotion recognition. Feature extraction methods based on facial expression images are mostly based on gray images. However, during the conversion of color images into gray images, the latent physiological signals in the color information and the hidden physiological signals contained in facial videos that have discriminant information for emotion recognition are lost. In this study, a novel dual-modality video emotion recognition method for fusion decision, which combines facial expressions and blood volume pulse (BVP) physiological signals that can be extracted from facial videos, is introduced to overcome this problem. Method First, the video is preprocessed (including face detection and normalization) to acquire a sequence of video frames that contain only the face image. The LBP-TOP feature is an effective local texture descriptor, whereas the HOG-TOP feature is a gradient-based local shape descriptor that can compensate for the lack of LBP-TOP feature extraction in image edge and direction information. Thus, in this study, we extract the LBP-TOP and HOG-TOP features from the video frames and fuse the two facial expression features. We use video color amplification technology to process the original video and extract the BVP physiological signal from the processed video. Then, the emotional feature of physiological signals can be extracted from the BVP physiological signal. Afterward, the two features are inputted into the BP classifier to train the classification models. Finally, the fuzzy integral is used to fuse the posterior probability information obtained by the two classifiers to obtain the final emotion recognition result. Result Considering that the current commonly used video emotion databases cannot satisfy the requirements for extracting the BVP signal, we conduct experimental verification by using the self-built facial expression video database. Each group of experiments was cross-validated, and the final results were averaged to increase the credibility of the experiment. The average recognition rates of single modality, i.e., facial expression or physiological signal, are 80% and 63.75%, respectively, whereas the emotion recognition result of the fusion of the two modalities is up to 83.33%, which is higher than that of each single modality before fusion. This finding indicates that the fusion decision algorithm with facial expression and BVP physiological signal is effective for emotion recognition. The experimental results of other fusion methods, namely, the D-S evidence theory and the maximum value rule, are 71% and 80%, respectively, which are lower than that of the fuzzy integral method. In addition, the recognition rate of our method is 2% and 2.5% higher than the results of the two existing video emotion recognition methods. Conclusion The dual-modality space-time feature fusion method proposed in this study characterizes the emotion information contained in the facial videos from two aspects, i.e., the facial expression and the physiological signals, to make full use of the emotional information of the video. The experimental results show that this algorithm can make full use of the emotion information of the video and effectively improve the classification performance of video emotion recognition. The effectiveness of our proposed method in comparison to that of similar video emotion recognition algorithms is verified. In addition, the fuzzy integral is used to fuse two different modalities at the decision level. The reliability of different classifiers in the fusion process is considered and compared with that of D-S evidence theory and the maximum value rule. The influence of unreliable decision-making information on the fusion decision is effectively reduced. Finally, a high recognition accuracy is obtained by the proposed fusion method. The contrast experiment with other fusion methods also proves the superiority of the proposed fusion method.