目的: 相较于传统表情，自发表情更能揭示一个人的真实情感，在国家安防、医疗等领域有巨大的应用潜力。由于自发表情具有诱导困难、样本难以采集等特殊性，因此数据样本较少。为判别自发表情的种类，结合在越来越多的场景得到广泛应用的神经网络学习方法，提出基于深度迁移网络的表情种类判别方法。方法: 为保留原始自发表情图片的特征，即使在小数据样本上也不使用数据增强技术，并将光流特征三维图像作为对比样本。将样本置入不同的迁移网络模型中进行训练，然后将经过训练的同结构的网络组合成同构网络并输出结果，从而实现自发表情种类的判别。结果: 实验结果表明本文提出的方法在不同数据库上均表现出优异的自发表情分类判别特性。在开放的自发表情数据库，CASME、CASMEⅡ与CAS(ME)2上的测试平均准确率分别达到了94.3%、97.3%与97.2%，比目前最好测试结果高7%。结论: 本文将迁移学习方法应用于自发表情种类的判别，并对不同网络模型以及不同种类的样本进行比较，取得目前最优的自发表情种类判别的平均准确率。
Classification of small spontaneous expression databasebased on deep transfer learning network
Fu Xiaofeng,Wu Jun,Niu Li(School of Computer Science and Technology)
Objective: Expression plays an important role in human-computer interaction at present. As a special expression, spontaneous expression has characteristics of shorter duration and weaker intensity when it is compared to traditional expressions. And spontaneous expression can reveal a person''s true emotions and has great potential application in detection, anti-detection and medical diagnosis, etc. Therefore, identifying the categories of spontaneous expression can make human-computer interaction smoother and fundamentally change the relationship between people and computers. Because spontaneous expression is difficult to be induced and collected, the scale of spontaneous expression dataset is relatively small for training a new deep neural network. There are only ten thousand spontaneous samples in each database. Convolutional neural network shows excellent performance, which is widely used in more and more scenes. For instance, it is better than traditional feature extraction method in aspect of improving the accuracy of discriminating the categories of spontaneous expression. Method: This paper proposes a method based on different deep transfer network models for discriminating the categories of spontaneous expression. In order to preserve the characteristics of original spontaneous expression, technique of data enhancement will not be used to decrease the risk of converging hardly. At the same time, training samples, three-dimensional images being composed by optical flow image and grayscale image, are compared with the original RGB images. The three-dimensional image contains spatial information and temporal displacement information. In this paper, we compare three network models with different samples. The first model is based on Alexnet that only changes the number of output layer neurons which is equal to the number of categories of spontaneous expression. Then, the network is fine-tuned to get best training results and testing results through fixing parameters of different layer many times. The second model is based on InceptionV3. Two fully connected layers whose neurons numbers are 512 and the number of spontaneous expression categories respectively, are added to output results. So we only need fine-tuning parameters of the two layers. It is worth to notice that the depth of network has increased with reducing number of parameters, due to 3*3 convolution kernel replacing the 7*7 convolution kernel. And the third model is based on Inception-ResNet-v2. Similar to the first model, we only change the number of output layer neurons. Finally, the isomorphic network model is proposed to identify categories of spontaneous expression. It is composed of two transfer learning networks of same type that are trained by different samples and takes the maximum as final output. The isomorphic network makes decision with high accuracy, because the same output of the isomorphic network is infinitely close to standard answer. From the perspective of probability, it is a good choice to take the maximum of different outputs as prediction value. Result: Experimental result shows that the proposed method exhibits excellent classification performance on different samples. From the single network output, it is clear that features extracted from RGB images are as effective as the features extracted from optical flow three-dimensional images. It is indicated that spatiotemporal features extracted by optical flow method can be replaced by features extracted from deep neural network. Simultaneously, it shows that features extracted from neural network can replace the lost information and features at a certain degree, such as temporal features of RGB image or color features of OF+ image. High average accuracy of single network indicates that it has good testing performance on each dataset. And network with higher complex structure has better performance, because samples of spontaneous expression can train deep transfer learning network well enough. The proposed models achieve the-state-of-art performance and the average accuracy is over 96%. After analyzing the result of isomorphic network model, we can know it does not express better performance than the single network in some cases, because single network has a high confidence degree in discriminating the categories of spontaneous expression, so isomorphic network is hard to improve the average accuracy. Conclusion: Compared to traditional expression, spontaneous expression has characteristics of changing subtly and extracting features difficultly. In this paper, different transfer learning networks are applied to discriminating the categories of spontaneous expression. At the same time, testing accuracy of different networks, which are trained by different kinds of samples, is compared together. Experimental result shows that deep learning has obvious advantages in spontaneous expression feature extraction in contrast to traditional methods. It also proves that deep network can extract complete features from spontaneous expression and is robust on different databases because it has good testing result on them. In the future, we will make our effort to extract spontaneous expressions directly from video and identify the categories of spontaneous expression with higher accuracy by removing distracting items, such as blinking.