黄龙,杨媛,王庆军,郭飞,高勇(西安理工大学自动化与信息工程学院, 西安 710048;西安中车永电电气有限公司, 西安 710018)
目的 视觉假体通过向盲人体内植入电极刺激视神经产生光幻视，盲人所能感受到的物体只是大体轮廓，对物体识别率低，针对视觉假体中室内应用场景的特点，提出一种快速卷积神经网络图像分割方法对室内场景图像进行分割，通过图像分割技术把物品大致的位置和轮廓显示出来，辅助盲人识别。方法 构建了用于室内场景图像分割的FFCN（fast fully convolutional networks）网络，通过层间融合的方法，避免连续卷积对图像特征信息的损失。为了验证网络的有效性，创建了室内环境中的基本生活物品数据集（以下简称XAUT数据集），在原图上通过灰度标记每个物品的类别，然后附加一张颜色表把灰度图映射成伪彩色图作为语义标签。采用XAUT数据集在Caffe（convolutional architecture for fast feature embedding）框架下对FFCN网络进行训练，得到适应于盲人视觉假体的室内场景分割模型。同时，为了对比模型的有效性，对传统的多尺度融合方法FCN-8s、FCN-16s、FCN-32s等进行结构微调，并采用该数据集进行训练得到适用于室内场景分割的相应算法模型。结果 各类网络的像素识别精度都达到了85%以上，均交并比（MIU）均达到60%以上，其中FCN-8s at-once网络的均交并比最高，达到70.4%，但其分割速度仅为FFCN的1/5。在其他各类指标相差不大的前提下，FFCN快速分割卷积神经网络上平均分割速度达到40帧/s。结论 本文提出的FFCN卷积神经网络可以有效利用多层卷积提取图像信息，避免亮度、颜色、纹理等底层信息的影响，通过尺度融合技术可以很好地避免图像特征信息在网络卷积和池化中的损失，相比于其他FCN网络具有更快的速度，有利于提高图像预处理的实时性。
Indoor scene segmentation based on fully convolutional neural networks
Huang Long,Yang Yuan,Wang Qingjun,Guo Fei,Gao Yong(Xi'an University of Technology, Xi'an 710048, China;CRCC Corporation Limited Xi'an Yonge Electric Co. Ltd., Xi'an 710018, China)
Objective Vision is one of the most important ways by which humans obtain information. Visual prosthesis refers to the process where electrodes are implanted into a blind body to stimulate the optic nerve, such that the blind can see hallucinations. Therefore, the objects felt by the blind are only the general features, such as low resolution and poor linearity. In some cases, the blind can hardly distinguish optical illusions. Before the electrodes were stimulated, image segmentation was adopted to display the general position and outline of objects to help blind people clearly recognize every familiar object. The image fast segmentation of the convolution neural network was proposed to segment the indoor scene of visual prosthesis in terms of its application features. Method According to the demand of visual prosthesis for real-time image processing, the fast fully convolutional network (FFCN) network structure proposed in this paper was improved on the AlexNet classification network structure. The network reduced the error rate of top five in the ImageNet dataset to 16.4%, which was better than the 26.2% of the second. The AlexNet uses the convolution layer to extract deep feature information, adds the structure of the overlapping pool layer to reduce the parameters that must be learned, and defines the Relu activation function to solve the gradient diffusion of the Sigmod function in deeper networks. In contrast to other networks, it presents characteristics such as light weight and fast training speed. First, the FFCN for image segmentation in the indoor scene was constructed. It was composed of five convolution layers and one deconvolution layer. The loss produced by the continuous convolution in the picture feature information was avoided by scale fusion. To verify the effectiveness of the network, a dataset of basic items that can be touched by the blind in an indoor environment was created. The dataset was divided into nine categories and included 664 items, such as beds, seats, lamps, televisions, cupboards, cups, and people (XAUT dataset). The type of each item was marked by grayscale in the original image, and a color table was added to map the gray image into pseudo-color map as the semantic label. The XAUT dataset was used to train the FFCN network under the Caffe framework, and the image features were extracted using the deep learning feature and scale fusion of the convolution neural network to obtain the segmentation model in the indoor scene for adapting to the visual prosthesis for the blind. To assess the validity of the model, the fine adjustment of traditional models, including FCN-8s, FCN-16s, FCN-32s, and FCN-8s at-once, was examined. The dataset was used to obtain the corresponding segmentation model in the indoor scene for adapting to the visual prosthesis for the blind. Results A comparative experiment was conducted on the Ubuntu16.04 version of the Amax Sever environment. The training time of the model lasted for 13 h, and a training model was saved every 4 000 iterations. The tests are tested at 4 000, 12 000, 36 000, and 80 000 iterations. The pixel recognition accuracy of all kinds of networks exceeded 85%, and the mean IU was above 60%. The FCN-8s at-once network had the highest mean IU (70.4%), but its segmentation speed was only one-fifth of that of the FFCN. Under the assumption that the other indicators differed insignificantly, the average segmentation speed of the FFCN reached 40 frame/s. Conclusion The FFCN can effectively use multi-layer convolution to extract picture information and avoid the influences of the underlying information, such as brightness, color, and texture. Moreover, it can avoid the loss of image feature information in the network convolution and pool through scale fusion. Compared with other FCN networks, the FFCN has a faster speed and can improve the real-time image preprocessing.