陈朋,汤一平,王丽冉,何霞(浙江工业大学信息工程学院, 杭州 310023)
目的 人群数量和密度估计在视频监控、智能交通和公共安全等领域有着极其重要的应用价值。现有技术对人群数量大，复杂环境下人群密度的估计仍存在较大的改进空间。因此，针对密度大、分布不均匀、遮挡严重的人群密度视觉检测，提出一种基于多层次特征融合网络的人群密度估计方法，用来解决人群密度估计难的问题。方法 首先，利用多层次特征融合网络进行人群特征的提取、融合、生成人群密度图；然后，对人群密度图进行积分计算求出对应人群的数量；最后，通过还原密度图上人群空间位置信息并结合估算出的人群数量，对人群拥挤程度做出量化判断。结果 在Mall数据集上本文方法平均绝对误差（MAE）降至2.35，在ShanghaiTech数据集上MAE分别降至20.73和104.86，与现有的方法进行对比估计精度得到较大提升，尤其是在环境复杂、人数较多的场景下提升效果明显。结论 本文提出的多层次特征融合的人群密度估计方法能有效地对不同尺度的特征进行提取，具有受场景约束小，人群数量估计精度高，人群拥挤程度评估简单可靠等优点，实验的对比结果验证了本文方法的有效性。
Crowd density estimation based on multi-level feature fusion
Chen Peng,Tang Yiping,Wang Liran,He Xia(School of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China)
Objective With the noticeable growth in population, large-scale collective activities have become increasingly frequent. In recent years, a series of social problems have become progressively prominent due to crowding of crowds. In particular, frequent accidents occur in densely populated areas, such as scenic spots, railway stations, and shopping malls. Crowd analysis has become an important research topic of intelligent video surveillance. Crowd density estimation has also become the focus of crowd safety control and management research. Crowd density estimation can help the staff to optimize the management of the statistics of the crowd in the current situation. Preventing overcrowding and detecting potential safety issues are important contributions of such a process. However, several of the available technologies are only applicable to a small number of people, and the environment is relatively static scene. Aiming at the visual detection of crowd density, uneven distribution, and occlusion crowd density, this study proposes a crowd density estimation method based on multi-level feature fusion network. Method First, we generate the feature map of each level using the convolutional pooling of the network. After five out of eight convolution layers are generated, a feature map that is 1/32 of the original size and 128 dimensions is generated and then perform three deconvolution operations. Thereafter, the convolutional layer features of the previous stage are fused together. Finally, the convolution layer is convoluted using a 1×1 volumetric kernel to form a density feature map of 1/4 of the original size. For the image, each convolution operation is an abstraction of the image features of the previous layer, and its different depths correspond to different levels of semantic features. Moreover, if convolution' shallow network resolution is high, the additional image details are found. However, if its deep network resolution is low, then deep semantic and some key features should be learned. Low-level features can be suitably used to extract small target features, whereas high-level features can be used to extract large target features. We solve the problem of inconsistent image scales by combining the feature information of different layers. Second, we use the public dataset to generate the corresponding density label map using our artificial calibration and then train the network to independently predict the density map of the test image. Finally, by integrating the density map, on the basis of the generated density map, we propose a quantitative method of crowd extent, and the crowd crowding is calculated through the reduction and combination of crowd spatial information on density map. Result The proposed method reduces the MAE to 2.35 on the mall dataset and reduces the MAE to 20.73 and 104.86 on the ShanghaiTech dataset. Compared with the existing methods, the crowd density estimation accuracy is improved, having a noticeable effect on the environment with complex number of scenes. In addition, the experimental results of different network structures show an improvement of the test results after adding the deconvolution layer compared with pure convolutional networks. Under the complex scene of ShanghaiTech dataset, after the feature fusion network, the performance has further improved, especially the integration of 1, 2 features, which generates a more prominent effect. When the integration of the characteristics of the three layer basically does not improve the effect, the main reason is the level is too high and contains additional details. Moreover, several redundant information affects the generalization of the network capacity. The effect of network improvements is also not noticeable for the mall dataset with the standard scenario. However, when we use a pure convolutional network, the result is noticeable. Conclusion This study proposes a crowd density estimation method based on multi-level feature fusion network. Through the extraction and fusion of the features of different semantic layers, the network can extract the features of people in different scales and sizes, which effectively improves the robustness of the algorithm. Using the complete picture as the input better preserves the overall picture information, the feature space location information is considered in network training. This algorithm is more scientific and efficient when using the density map generated by forecasting in combination with the spatial information in the estimation of the number of people and the degree of congestion. The algorithm also has the advantages in small scene constraints, high crowd estimation accuracy, and simple and reliable crowd congestion assessment. The effectiveness of the proposed multi-level feature fusion network and crowd congestion evaluation method is verified through experiments.