方正,曹铁勇,洪施展,项圣凯(陆军工程大学指挥控制工程学院, 南京 210001)
目的 显著性检测是图像和视觉领域一个基础问题，传统模型对于显著性物体的边界保留较好，但是对显著性目标的自信度不够高，召回率低，而深度学习模型对于显著性物体的自信度高，但是其结果边界粗糙，准确率较低。针对这两种模型各自的优缺点，提出一种显著性模型以综合利用两种方法的优点并抑制各自的不足。方法 首先改进最新的密集卷积网络，训练了一个基于该网络的全卷积网络（FCN）显著性模型，同时选取一个现有的基于超像素的显著性回归模型，在得到两种模型的显著性结果图后，提出一种融合算法，融合两种方法的结果以得到最终优化结果，该算法通过显著性结果Hadamard积和像素间显著性值的一对一非线性映射，将FCN结果与传统模型的结果相融合。结果 实验在4个数据集上与最新的10种方法进行了比较，在HKU-IS数据集中，相比于性能第2的模型，F值提高了2.6%；在MSRA数据集中，相比于性能第2的模型，F值提高了2.2%，MAE降低了5.6%；在DUT-OMRON数据集中，相比于性能第2的模型，F值提高了5.6%，MAE降低了17.4%。同时也在MSRA数据集中进行了对比实验以验证融合算法的有效性，对比实验结果表明提出的融合算法改善了显著性检测的效果。结论 本文所提出的显著性模型，综合了传统模型和深度学习模型的优点，使显著性检测结果更加准确。
Saliency detection via fusion of deep model and traditional model
Fang Zheng,Cao Tieyong,Hong Shizhan,Xiang Shengkai(Institute of Command and Control Engineering, Army Engineering University, Nanjing 210001, China)
Objective Saliency detection is a fundamental problem in computer vision and image processing, which aims to identify the most conspicuous objects or regions in an image. Saliency detection has been widely used in several visual applications, including object retargeting, scene classification, visual tracking, image retrieval, and semantic segmentation. In most traditional approaches, salient objects are derived based on the extracted features from pixels or regions. Final saliency maps consist of these regions with their saliency scores. The performance of these models rely on the segmentation methods and the selection of features. These approaches cannot produce satisfactory results when images with multiple salient objects or low-contrast contents are encountered. Traditional approaches preserve the boundaries well but with insufficient confidence of salient objects, which yield low recall rates. Convolution neural networks (CNNs) have been introduced in pixel-wise prediction problems, such as saliency detection, due to their outstanding performance in image classification tasks. CNNs redefine the saliency problem as a labeling problem where the feature selection between salient and non-salient objects is automatically performed through gradient descent. A CNN cannot be directly used to train a saliency model, and a CNN can be utilized in saliency detection by extracting a square patch around each pixel and by using the patch to predict the center pixel's class. Patches are frequently obtained from different resolutions of the input image to capture global information. Another method is the addition of up-sampled layers in the CNN. A modified CNN is called fully connected network (FCN), which is first proposed for semantic segmentation. Most saliency detection CNN models use FCN to capture considerable global and local information. FCN is a popular model that modifies the CNN to fit dense prediction problem, which replaces the SoftMax and fully connected layers in the CNN into convolution and deconvolution layers. Compared with traditional methods, FCNs can accurately locate salient objects and yield their high confidence. However, the boundaries of salient objects are coarse and their precision rates are lower than the traditional approaches due to the down-sampling structure in FCNs. To deal with the limitations of the 2 kinds of saliency models, we proposed a novel composite saliency model that combines the advantages and restrains the drawbacks of two saliency models. Method In this study, a new FCN based on dense convolutional network (DenseNet) is built. For saliency detection, we replace the fully connected layer and final pooling layer into a 1×1 kernel size convolution layer and a deconvolution layer. A sigmoid layer is applied to obtain the saliency maps. In the training process, the saliency network end with a squared Euclidean loss layer for saliency regression. We fine-tune the pre-trained DenseNet-161 to train our saliency model. Our training set consists of 3 900 images that are randomly selected from 5 saliency public dataset, namely, ECSSD, SOD, HKU-IS, MSRA, and ICOSEG. Our saliency network is implemented in Caffe toolbox. The input images and ground-truth maps are resized to 500×500 for training, the momentum parameter is set to 0.99, the learning rate is set to 10-10, and the weight decay is 0.000 5. The SGD learning procedure is accelerated using a NVIDIA GTX TITAN X GPU device, which takes approximately one day in 200 000 iterations. Then, we use a traditional saliency model. The selected model adopts multi-level segmentation to produce several segmentations of an image, where each superpixel is represented by a feature vector that contains different kinds of image features. A random forest is trained by those feature vectors to derive saliency maps. On the basis of the 2 models, we propose a fusion algorithm that combines the advantages of traditional approaches and deep learning methods. For an image, 15 segmentations of the image are produced, and the saliency maps of all segmentations are derived by the random forest. Then, we use FCN to produce another type of saliency map of the image. The fusion algorithm applies the Hadamard product on the 2 types of saliency maps, and the initial fusion result is obtained by averaging the Hadamard product results. Then, an adaptive threshold is used to fuse the initial fusion and FCN results by using a pixel-to-pixel map to obtain the final fusion result. Result We compared our model with 10 state-of-the-art saliency models, including the traditional approaches and deep learning methods on 4 public datasets, namely, DUT-OMRON, ECSSD, HKU-IS, and MSRA. The quantitative evaluation metrics contained F-measure, mean square error (MAE), and PR curves, and we provided several saliency maps of each method for comparison. The experiment results show that our model outperforms all other methods in HKU-IS, MSRA, and DUT-OMRON datasets. The saliency maps showed that our model can produce refined results. We compared the performance of random forest, FCN, and final fusion results in verify the effectiveness of our fusion algorithm. Comparative experiments demonstrated that the fusion algorithm improves saliency detection. Compared with the random forest results in ECSSD, HKU-IS, MSRA, and DUT-OMRON, the F-measure (higher is better) increased by 6.2%, 15.6%, 5.7%, and 16.6% and MAE (i.e., less is better) decreased by 17.4%, 43.9%, 33.3%, and 24.5% respectively. Compared with the FCN results in ECSSD, HKU-IS, MSRA, and DUT-OMRON, the F-measure increased by 2.2%, 4.1%, 5.7%, and 11.3%, respectively, and MAE decreased by 0.6%, 10.7%, and 18.4% in ECSSD, MSRA, and DUT-OMRON, respectively. In addition, we conducted a series of comparative experiments in MSRA to clearly show the effectiveness of different steps of the fusion algorithm. Conclusion In this study, we proposed a composite saliency model that contains an FCN and a traditional model and a fusion algorithm to fuse 2 kinds of saliency maps. The experiment results show that our model outperforms several state-of-the-art saliency approaches and the fusion algorithm improves the performance.