綦金玮,彭宇新,袁玉鑫(北京大学计算机科学技术研究所, 北京 100080)
目的 跨媒体检索旨在以任意媒体数据检索其他媒体的相关数据，实现图像、文本等不同媒体的语义互通和交叉检索。然而，"异构鸿沟"导致不同媒体数据的特征表示不一致，难以实现语义关联，使得跨媒体检索面临巨大挑战。而描述同一语义的不同媒体数据存在语义一致性，且数据内部蕴含着丰富的细粒度信息，为跨媒体关联学习提供了重要依据。现有方法仅仅考虑了不同媒体数据之间的成对关联，而忽略了数据内细粒度局部之间的上下文信息，无法充分挖掘跨媒体关联。针对上述问题，提出基于层级循环注意力网络的跨媒体检索方法。方法 首先提出媒体内-媒体间两级循环神经网络，其中底层网络分别建模不同媒体内部的细粒度上下文信息，顶层网络通过共享参数的方式挖掘不同媒体之间的上下文关联关系。然后提出基于注意力的跨媒体联合损失函数，通过学习媒体间联合注意力来挖掘更加精确的细粒度跨媒体关联，同时利用语义类别信息增强关联学习过程中的语义辨识能力，从而提升跨媒体检索的准确率。结果 在2个广泛使用的跨媒体数据集上，与10种现有方法进行实验对比，并采用平均准确率均值MAP作为评价指标。实验结果表明，本文方法在2个数据集上的MAP分别达到了0.469和0.575，超过了所有对比方法。结论 本文提出的层级循环注意力网络模型通过挖掘图像和文本的细粒度信息，能够充分学习图像和文本之间精确跨媒体关联关系，有效地提高了跨媒体检索的准确率。
Cross-media retrieval with hierarchical recurrent attention network
Qi Jinwei,Peng Yuxin,Yuan Yuxin(Institute of Computer Science and Technology, Peking University, Beijing 100080, China)
Objective Cross-media retrieval aims to retrieve the data of different media types by a query, which can provide flexible and useful retrieval experience with numerous user demands at present. However, a "heterogeneity gap" leads to inconsistent representations of different media types, thus resulting in a challenging construction of correlation and realizing cross-media retrieval between them. However, data from different media types naturally have a semantic consistency, and their patches contain abundant fine-grained information, which provides key clues for cross-media correlation learning. Existing methods mostly consider a pairwise correlation of various media types with the same semantics, but they ignore the context information among the fine-grained patches, which cannot fully capture the cross-media correlation. To address this problem, a cross-media hierarchical recurrent attention network (CHRAN) is proposed to fully consider the intra-and inter-media fine-grained context information. Method First, we propose to construct a hierarchical recurrent network to fully exploit the cross-media fine-grained context information. Specifically, the hierarchical recurrent network consists of two levels, which are implemented by a long short-term memory network. We extract features from the fine-grained patches of different media types and organize them into sequences, which are considered the inputs of the hierarchical network. The bottom level aims to model the intra-media fine-grained context information, whereas the top level adopts a weight-sharing constraint to fully exploit inter-media context correlation, which aims to share the knowledge learned from different media types. Thus, the hierarchical recurrent network can provide intra-and inter-media fine-grained hints for boosting cross-media correlation learning. Second, we propose an attention-based cross-media joint embedding loss to learn a cross-media correlation. We utilize an attention mechanism to allow the models to focus on the necessary fine-grained patches within various media types, thereby allowing the inter-media co-attention to be explored. Furthermore, we jointly consider the matched and mismatched cross-media pairs to preserve the relative similarity ranking information. We also adopt a semantic constraint to preserve the semantically discriminative capability during the correlation learning process. Therefore, a precise fine-grained cross-media correlation can be captured to improve retrieval accuracy. Result We conduct experiments on two widely-used cross-media datasets, including Wikipedia and Pascal Sentence datasets, which consider 10 state-of-the-art methods for comprehensive comparisons to verify the effectiveness of our proposed CHRAN approach. We perform a cross-media retrieval with two types of retrieval tasks, that is, retrieving text by image and retrieving the image by text, and then we adopt mean average precision (MAP) score as the evaluation metric. We also conduct baseline experiments to verify the contribution of a weight-sharing constraint and cross-media attention modeling. The experimental results show that our proposed approach achieves the optimal MAP scores of 0.469 and 0.575 on two datasets and outperforms the state-of-the-art methods. Conclusion The proposed approach can effectively learn a fine-grained cross-media correlation precisely. Compared with the existing methods that mainly model the pairwise correlation and ignore the fine-grained context information, our proposed hierarchical recurrent network can fully capture the intra-and inter-media fine-grained context information with a cross-media co-attention mechanism that can further promote the accuracy of cross-media retrieval.