王晗,俞璜悦,滑蕊,邹玲(北京林业大学信息学院, 北京 100083;北京电影学院数字媒体学院, 北京 100088)
目的 视频精彩片段提取是视频内容标注、基于内容的视频检索等领域的热点研究问题。视频精彩片段提取主要根据视频底层特征进行精彩片段的提取，忽略了用户兴趣对于提取结果的影响，导致提取结果可能与用户期望不相符。另一方面，基于用户兴趣的语义建模需要大量的标注视频训练样本才能获得较为鲁棒的语义分类器，而对于大量训练样本的标注费时费力。考虑到互联网中包含内容丰富且易于获取的图像，将互联网图像中的知识迁移到视频片段的语义模型中可以减少大量的视频数据标注工作。因此，提出利用互联网图像的用户兴趣的视频精彩片段提取框架。方法 利用大量互联网图像对用户兴趣语义进行建模，考虑到从互联网中获取的知识变化多样且有噪声，如果不加选择盲目地使用会影响视频片段提取效果，因此，将图像根据语义近似性进行分组，将语义相似但使用不同关键词检索得到的图像称为近义图像组。在此基础上，提出使用近义语义联合组权重模型权衡，根据图像组与视频的语义相关性为不同图像组分配不同的权重。首先，根据用户兴趣从互联网图像搜索引擎中检索与该兴趣语义相关的图像集，作为用户兴趣精彩片段提取的知识来源；然后，通过对近义语义图像组的联合组权重学习，将图像中习得的知识迁移到视频中；最后，使用图像集中习得的语义模型对待提取片段进行精彩片段提取。结果 本文使用CCV数据库中的视频对本文提出的方法进行验证，同时与多种已有的视频关键帧提取算法进行比较，实验结果显示本文算法的平均准确率达到46.54，较其他算法相比提高了21.6%，同时算法耗时并无增加。此外，为探究优化过程中不同平衡参数对最终结果的影响，进一步验证本文方法的有效性，本文在实验过程中通过移除算法中的正则项来验证每一项对于算法框架的影响。实验结果显示，在移除任何一项后算法的准确率明显降低，这表明本文方法所提出的联合组权重模型对提取用户感兴趣视频片段的有效性。结论 本文提出了一种针对用户兴趣语义的视频精彩片段提取方法，根据用户关注点的不同，为不同用户提取其感兴趣的视频片段。
Video highlight extraction based on the interests of users
Wang Han,Yu Huangyue,Hua Rui,Zou Ling(School of Information Science & Technology, Beijing Forestry University, Beijing 100083, China;School of Digital Media, Beijing Film Academy, Beijing 100088, China)
Objective Video highlight extraction is of interest in video summary, organization, browsing, and indexing. Current research mainly focuses on extraction by optimizing the low-level feature diversity or representativeness of video frames, ignoring the interests of users, which leads to extraction results that are inconsistent with the expectation of users. However, collecting a large number of required labeled videos to model different user interest concepts for different videos is time consuming and labor intensive. Method We propose to learn models for user interest concepts on different videos by leveraging numerous Web images that which cover many roughly annotated concepts and are often captured in a maximally informative manner to alleviate the labeling process. However, knowledge from the Web is noisy and diverse such that brute force knowledge transfer may adversely affect the highlight extraction performance. In this study, we propose a novel user-oriented keyframe extraction framework for online videos by leveraging a large number of Web images queried by synonyms from image search engines. Our work is based on the observation that users may have different interests in different frames when browsing the same video. By using user interest-related words as keywords, we can easily collect weakly labeled image data for interest concept model training. Given that different users may have different descriptions of the same interest concept, we denote different descriptions with similar semantic meanings as synonyms. When querying images from the Web, we use synonyms as keywords to avoid semantic one-sidedness. An image set returned by a synonym is considered a synonym group. Different synonym groups are weighted according to their relevance to the video frames. Moreover, the group weights and classifiers are simultaneously learned by a joint synonym group optimization problem to make them mutually beneficial and reciprocal. We also exploit the unlabeled online videos to optimize the group weights and classifiers for building the target classifier. Specifically, new data-dependent regularizers are introduced to enhance the generalization capability and adaptiveness of the target classifier. Result Our method's mAP achieved 46.54 in average and boosted 21.6% compare to the stat-of-the-art without take much longer time. Experimental results several challenging video datasets that using grouped knowledge obtained from Web images for video highlight extraction is effective and provides comprehensive results. Conclusion We presented a new framework for video highlight extraction by leveraging a large number of loosely labeled Web images. Specifically, we exploited synonym groups to learn more sophisticated representations of source domain Web images. The group classifiers and weights are jointly learned in a unified optimization algorithm to build the target domain classifiers. We also introduced two new data-dependent regularizers based on the unlabeled target domain consumer videos to enhance the generalization capability of the target classifier.