凌佩佩,邱崧,蔡茗名,徐伟,丰颖(华东师范大学信息科学技术学院, 上海市多维度信息处理重点实验室, 上海 200241;上海交通大学图像处理与模式识别研究所, 上海 200240)
目的 采用传统的2维特征提取方法，很难从视频中准确地捕获出人体的关节点位置，限制了识别率的上限。采用深度信息的3维特征提取能提升识别率，但高维空间运算复杂度高，很难实现实时识别，受应用场景限制。为克服上述难点，提出一种基于3维特权学习的人体动作识别方法，将3维信息作为特权信息引入到传统的2维动作识别过程中，用来识别人体动作。方法 以运动边界直方图密集光流特征、Mosift（Motion SIFT）特征和多种特征结合的混合特征作为2维基本特征。从Kinect设备获得的深度信息中评估出人体的关节点信息，并用李群算法处理得到3维特征作为特权信息。特权信息在经典支持向量机下的识别效果优于2维基本特征。训练数据包含2维基本特征和3维特权信息，测试数据只有2维基本特征。通过训练样本学习，得到结合特权信息的支持向量机（SVM+），使用该向量机对测试样本进行分类，得到人体动作识别结果。结果 在UTKinect-Action和Florence3D-Action两个人体动作数据集上进行实验。引入特权信息后，人体动作识别率较传统2维识别有2%的平均提升，最高达到9%。SVM+分类器对参数的敏感性较SVM下降。结论 实验结果表明，本文方法较以往方法，在提升识别准确率的同时，降低了分类器对参数的敏感性。本文方法仅在训练过程中需要同时提取2维基本特征和3维特权信息，而在测试过程中无需借助深度信息获取设备提取3维特权特征信息，学习速度快，运算复杂度低，可广泛应用于低成本，高实时的人体动作识别场合。
Human action recognition based on privileged information
Ling Peipei,Qiu Song,Cai Mingming,Xu Wei,Feng Ying(Shanghai Key Laboratory of Multidimensional Information Processing, College of Information Science Technology, East China Normal University, Shanghai 200241, China;Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, China)
Objective The study of human action recognition is an area with important academic and application values. It is widely applied to the fields of intelligent surveillance, video retrieval, human interaction, live entertainment, virtual reality, and health care. In human learning, a teacher can provide students with information hidden in examples, explanations, comments, and comparisons. However, the information offered by a teacher is seldom applied to the field of human action recognition. This study considers 3D depth features as privileged information to help solve human action recognition problems and to demonstrate the superiority of a new learning paradigm over the classical learning paradigm. This paper reports on the details of the new paradigm and its corresponding algorithms. Method The human body can be represented as an articulated system with rigid segments connected by joints. Human motion can be regarded as a continuous evolution of the spatial configuration of these rigid segments. With the recent release of depth cameras, an increasing number of studies have extracted the 3D positions of tracked joints to represent human activities， these studies have achieved relatively good performance. However, relative 3D algorithms have numerous application limits resulting from inconvenient equipment and costly computation. The extraction of joints from RGB video sequences is difficult, which limits recognition result. This study applies 3D depth features as privileged information to solve the aforementioned challenge. In particular, we apply a new skeletal representation that explicitly models the 3D geometric relationships among different body parts that use rotations and translations in 3D space in the lie group. We use different algorithms, including motion scale-invariant feature transform, motion boundary histograms, and different combined descriptors, for the basic 2D features to unite privileged information. Privileged information is available in the training stage, but not in the testing stage. Similar to the traditional classification problem, the new algorithm focuses on learning a new classifier, i.e., support vector machine+ (SVM+). The SVM+ algorithm, which considers both privileged and unprivileged information, is highly similar to SVM algorithms in terms of determining solutions in the classical pattern recognition framework. In particular, it finds the optimal separating hyperplane, which incurs a few training errors and exhibits a large margin. However, the SVM+ algorithm is computationally costlier than SVM. This study applies the new algorithm to the field of human activity recognition to provide convenience to the testing set because 3D information is only required in the training set. Result We evaluate our method in two challenge databases, namely, UTKinect-Action and Florence3D-Action, with three different 2D features. The SVM+ algorithm considers both 2D basic features and 3D privileged information, whereas SVM only uses 2D basic features. Results show that our proposed SVM+ outperforms SVM. Moreover, SVM+ is less sensitive to relevant parameters than SVM. This paper reports on the details of the recognition performance, varying numbers of training samples, different parameters, and confusion matrix for both SVM and SVM+ on the two datasets. The privileged information can help to reduce the noise of the original 2D basic features and increase the robustness of human activity recognition. Conclusion The role of a teacher in providing remarks, explanations, and analogies is highly important. This study proposes a new human action recognition method based on privileged information. The experimental results of the two datasets show the effectiveness of our method in human action recognition. The proposed method is only required to extract 3D privileged information during the training process. A depth information acquisition device is not required during the testing process. This method exhibits high learning speed and low computational complexity. It can be extensively used in low-cost, real-time human action recognition.