摘 要 ：目的 在序列图像或多视角图像的目标分割中，传统的协同分割算法对复杂的多图像分割鲁棒性不强，而现有的深度学习算法在前景和背景存在较大歧义时容易导致目标分割错误和分割不一致。方法 本文提出一种基于深度特征的多图像分割算法。为了使模型更好的学习复杂场景下多视角图像的细节特征，首先对PSPNet-50进行了改进，将PSPNet-50第五部分网络输出的特征进行上采样，将第一部分网络输出的特征与上采样后的第五部分网络输出的特征进行拼接，通过融合浅层网络高分辨率的细节特征来改进PSPNet-50网络模型，减小随着网络的加深导致空间信息的丢失对分割边缘细节的影响。然后用ADE20k场景解析数据集对改进后的网络模型进行了预训练，使经过大量数据预训练后的模型具有较强鲁棒性和泛化能力。再通过交互分割算法获取一至两幅图像的先验分割，将少量先验分割融合到新的模型中，通过网络的再学习来解决前景/背景的分割歧义以及多图像的分割一致性，并通过多组实验结果绘制出再学习时迭代次数和分割精确度的折线关系图，找到较优的迭代次数。最后通过构建全连接条件随机场模型，将深度卷积神经网络的识别能力和全连接条件随机场优化的定位精度耦合在一起，更好地处理边界定位问题，进一步优化分割目标的边缘效果。结果 本文采用公共数据集的多图像集进行了分割测试，这些图像集中既有室外建筑物又有室内常见物体。实验结果表明本文算法不但可以更好的分割出经过大量数据预训练过的目标，而且对于没有预训练过的目标类，也能有效避免歧义的区域分割。本文算法不论对前景与背景区别明显的较简单图像集，还是对前景与背景颜色相似的较复杂图像集，平均像素准确度（PA）和交并比（IOU）均大于95%。结论 本文算法对各种场景的多图像分割都具有较强的鲁棒性，同时通过融入少量先验，使模型更有效的区分目标与背景，获得了分割目标的一致性。
Multi-image object segmentation based on deep learning
Liao Xuan,Miao Jun,Chu Jun,Zhang Guimei(Nanchang Hangkong University)
Abstract: Objective Object segmentation from multi images is the task of locating the positions and ranges of the common interested objects in the scene showing by the sequence images set or the multi views image. This is used and applied in various computer vision applications and beyond, such as object detection and tracking, scene understanding and 3D reconstruction. Early approaches regard object segmentation as a histogram matching of color values, and only applied to pair-wise images with the same or similar objects. Later, Object Co-Segmentation methods are introduced. Most of these methods take MRF model as the basic framework, and establish the cost function that consist of the energy within the image itself and the energy between the images by using the features calculating by grey or color values of pixels. The cost function is minimized to obtain the consistent segmentation. However, when the foreground and background colors in these images are similar, Co-Segmentation is difficult to achieve the object segmentation with consistent regions. In recent years, with the development of the deep learning technique, a number of methods based various deep learning models are proposed. Some methods, such as full convolutional network, adopt the convolutional neural network to extract the high-level semantic features of the image to attain end-to-end images classification with pixel level. These algorithms are capable to obtain the better precision than the traditional methods. Compared with those traditional methods, the deep learning methods can learn appropriate features automatically for the individual classes without manual selection and adjustment of features. In fact, exactly segmenting a single image need to combine multi levels spatial domain information. Hence, multi images segmentation not only demand fine grit accuracy in local regions as well as single image segmentation, but also require to achieve the balance of local and global information among multiple images. When facing the problems that the ambiguous regions around foreground and background is contained or sufficient priori information is not be given about object, most of deep learning methods also trend to lead error and inconsistent segmentation from sequential images set or multi views image. Method In this paper, we propose a multi images segmentation method based on depth feature exaction. Firstly, it is similar to the neural network model of PSPNet-50 that residual network is used to exact features of the first 50 layer network. These extracted features are integrated into pyramid pooling module by using pooling layers with different size of pooling filters, and then features of different levels are fused. After applying a convolutional layer and up-convolutional layer, the initial end to end outputs are attained. In order to make the model better learn more the detail features from multi views image of complex scenes, we joint the first part of output network features with the fifth part of the output network features. As a result, the PSPnet-50 network model is improved by integrating the high resolution details of the shallow layer network, which also is used to reduce the impact of spatial information loss on the segmentation edge details as the network deepen. In training phase, the improved network model is first pre-trained using ADE20k dataset, so that the model after a lot of data training has strong robustness and generalization ability. After that, one or two prior segmentation of the object is gained by using interactive segmentation approach, and then these small amounts of a priori segmentation integration are fused into the new model. And then, the network is re-trained to solve the ambiguity segmentation problem between the foreground and the background and the inconsistent segmentation problem among multi images. We analyze the relationship between the number of re-training iterations and the segmentation accuracy through a large amount of experimental results to find out the optimal number of iterations. Finally, by constructing a full connection conditional random field, the recognition ability of the deep convolutional neural network and the accurate locating ability by full connection condition random field are coupled together, the object region is better located and the object edge is better detected. Result We evaluate our method on multi images from various public data sets, in which there are outdoors buildings and indoor objects. We also compare our results to other update deep learning methods, such as Fully Convolutional Networks (FCN) and Pyramid Scene Parsing Network (PSPNet). Experiments in multi view dataset of “Valbonne” and “Box” show that our algorithm is capable to exactly segment the region of the object in re-training classes, but also effectively avoid the ambiguous region segmentation for those un-training object classes. In order to evaluate our algorithm quantitatively, the commonly used accuracy evaluation, average values of pixel accuracy (PA) and intersection over union (IOU), are computed to evaluate the segmentation accuracy of the object. The results show that our algorithm attains better scores not only in complex scene image sets with similar foreground and background context, but also in simple those with obvious differences between foreground and background context. For example, in “Valbonne” set, the PA values and IOU values of our result are 0.9683 and 0.9469 respectively, but the values of FCN are 0.7027and 0.6942 and the values of PSPNet are 0.8509 and 0.8240. Our method achieves higher 10 percent scores than FCN and 20 percent scores than PSPNet. In “Box” set, our method achieves the PA values 0.9946 and IOU values 0.9577. However, both FCN and PSPNet are unable to find the real region of the object because the “Box” class is not contained in their re-training classes. The same improvements are come out in other data sets. Average scores of PA and IOU of our method are more than 0.95. Conclusion Experimental results demonstrate that our algorithm has the strong robustness in various scenes, and can achieve the more consistent segmentation in multi view images. A small amount of a priori is help to accurately predict object pixel level region, and to make the model more effectively distinguish object region from background. Comparing with previous methods, our approach consistently outperforms competing methods, for both contained and un-contained object classes.