目的 受光照变化、拍摄角度、物体数量和物体尺寸等因素的影响，室内场景下多目标检测容易出现准确性和实时性较低的问题。为解决此类问题，本文基于物体的彩色和深度图像组，提出了分步超像素聚合和多模态信息融合的目标识别检测方法。方法 在似物性采样（object proposal）阶段，依据人眼对显著性物体观察时先注意其色彩后判断其空间深度信息的理论，首先对图像进行超像素分割，然后结合颜色信息和深度信息对分割后的像素块分步进行多阈值尺度自适应超像素聚合，得到具有颜色和空间一致性的似物性区域；在物体识别阶段，为实现物体不同信息的充分表达，利用多核学习方法融合所提取的物体颜色、纹理、轮廓、深度多模态特征，将特征融合核输入支持向量机多分类机制中进行学习和分类检测。结果 实验在基于华盛顿大学标准RGB-D数据集和真实场景集上将本文方法与当前主流算法进行对比，得出本文方法整体的检测精度较当前主流算法提升4.7%，运行时间有了大幅度提升。其中分步超像素聚合方法在物体定位性能上优于当前主流似物性采样方法，并且在相同召回率下采样窗口数量约低于其他算法4倍；多信息融合在目标识别阶段优于单个特征和简单的颜色、深度特征融合方法。结论 结果表明在基于多特征的目标检测过程中本文方法能够有效利用物体彩色和深度信息进行目标定位和识别，对提高物体检测精度和检测效率具有重要作用。
Objective With the development of artificial intelligence, more and more scholars begin to study object detection in the field of computer vision, and is no longer merely content with the recent research on RGB images, the object detection methods which are based on the depth of the image has become a hot topic. But the accuracy and real-time performance of indoor multi-class objects detection is impressionable to illumination change, shooting angle, the number of objects and the object size. In order to improve the accuracy of detection, some studies have begun to usage deep learning methods. Although deep learning can effectively extract the underlying characteristics of objects at different levels, large samples and long learning time make it impossible to apply widely immediately. On the other hand, in terms of improving detection efficiency, there were many scholars who wanted to find all possible areas that contain objects according to the edge information of objects, thus reducing the number of detection windows. And later, some people used the method of deep learning to preselect it. To settle above existing matter, the paper proposes two methods by stages, which adopt RGB-D graphs. The first one is object proposal with super-pixel merging by steps , the other is object classification adopting the technology of multi-modal data fusion. Method In the stage of object proposal, the method segments images into super-pixels in the first and merges them by steps adopting the method of self-adaptive multi-threshold scale based on the color and depth information, according to the theory of eyes observing significant object"s color information firstly and then its depth information. The method proposes to segment the graph with Simple Linear Iterative Clustering (SLIC) and merges the super-pixel by two steps which calculates the area similarity respectively with color and depth information. In this way, the detection windows which have similar color and depth information will be extracted out to decrease window number through filtering them by area and adopting non-maximal suppression to detection results with the overlapping region. At the end of the process, the number of detected windows will be far less than using a sliding window scan and each area may contain an object or part of an object. And then in the object-recognition stage, the proposal method fuses the multi-modal features including color, texture, contour and depth, which are extracted from RGB-D images, employing the means of multi-kernel learning. In general, objects are confusable when identified with just one feature because of the multiplicity of objects. For example, we are difficult to distinguish an apple and the other which was painted in a picture. Actually, multi-modal data fusion can cover more abundant object characteristic in RGB-D images relative to single feature or simple fusion with two features. At last, the fusing feature kernel is input into the SVM classifier and the procedure of object detection is complete. Result By setting different threshold segmentation interval parameters and multi-kernel learning gauss kernel parameters, the paper does contrast to the method proposed and the current mainstream algorithm, it is concluded that the textual method has a certain advantage in object detection on the overall performance. The detection rate of the method is increased by 4.7% comparing with the state-of-art via the comparative experiment based on the standard RGB-D databases from University of Washington and the real scene databases which are obtained by Kinect sensor. Meanwhile, the method of sub-step merging of super-pixel is superior to the present mainstream object proposal methods in object location and the amounts of sampling windows are fourfold less than other algorithms approximately in the situation of same recall rate. And through comparing individual feature and the fusion-feature recognition accuracy, it is concluded that multi-feature fusion method is much higher than the individual characteristics and characteristics of the two fusion in the overall detection accuracy, also has outstanding performance on objects categories with different gestures . Conclusion The experiment results show that the proposed method could take full use of both the color and depth information in object location and classification and be important to getting higher accuracy and better real-time performance. At the same time, the object proposal method of sub-step merging of super-pixel can also be used well in the field of object detection based on deep learning.