目的 跨媒体检索旨在以任意媒体数据检索其他媒体的相关数据，实现图像、文本等不同媒体的语义互通和交叉检索。然而，“异构鸿沟”导致不同媒体数据的特征表示不一致，难以实现语义关联，使得跨媒体检索面临巨大挑战。而描述同一语义的不同媒体数据存在语义一致性，且数据内部蕴含着丰富的细粒度信息，为跨媒体关联学习提供了重要依据。现有方法仅仅考虑了不同媒体数据之间的成对关联，而忽略了数据内细粒度局部之间的上下文信息，无法充分挖掘跨媒体关联。针对上述问题，本文提出了基于层级循环注意力网络的跨媒体检索方法。方法 首先提出了媒体内-媒体间两级循环神经网络，其中底层网络分别建模不同媒体内部的细粒度上下文信息，顶层网络通过共享参数的方式挖掘不同媒体之间的上下文关联关系。然后提出基于注意力的跨媒体联合损失函数，通过学习媒体间联合注意力来挖掘更加精确的细粒度跨媒体关联，同时利用语义类别信息增强关联学习过程中的语义辨识能力，从而提升跨媒体检索的准确率。结果 本文在2个广泛使用的跨媒体数据集上，与10种现有方法进行实验对比，并采用平均准确率均值MAP作为评价指标。实验结果表明，本文提出的方法在2个数据集上的MAP分别达到了0.469和0.575，超过了所有对比方法。结论 本文提出的层级循环注意力网络模型通过挖掘图像和文本的细粒度信息，能够充分学习图像和文本之间精确跨媒体关联关系，有效地提高了跨媒体检索的准确率。
Cross-media retrieval with hierarchical recurrent attention network（NCIG2018）
Qi Jinwei,Peng Yuxin,Yuan Yuxin(Peking University)
Objective Cross-media retrieval aims to retrieve the data of different media types by a query of any media type, which can provide more flexible and useful retrieval experience with great user demands nowadays. However, the “heterogeneity gap” leads to inconsistent representations of different media types, which makes it quite challenging to construct correlation and realize cross-media retrieval between them. While data of different media types naturally have semantic consistency, and their patches contain rich fine-grained information, which provides key clues for cross-media correlation learning. Existing methods mostly consider the pairwise correlation of different media types with the same semantics, but they ignore the context information among the fine-grained patches, which cannot fully capture the cross-media correlation. To address the above problem, this paper proposes cross-media hierarchical recurrent attention network (CHRAN) to fully consider both intra-media and inter-media fine-grained context information. Method First, we propose to construct hierarchical recurrent network for fully exploiting the cross-media fine-grained context information. Specifically, the hierarchical recurrent network consists of two levels, which are implemented by long short-term memory (LSTM) network, and we extract features from the fine-grained patches of different media types and organize them into sequences respectively, taken as the inputs of hierarchical network. The bottom level aims to model the intra-media fine-grained context information respectively, while the top level adopts weight-sharing constraint to fully exploit inter-media context correlation, which aims to share the knowledge learned from different media types. Thus, the hierarchical recurrent network can provide both intra-media and inter-media fine-grained hints for boosting cross-media correlation learning. Second, we propose attention-based cross-media joint embedding loss to learn cross-media correlation. We utilize attention mechanism to allow the models focusing on the necessary fine-grained patches within different media types, which can explore the inter-media co-attention. Furthermore, we jointly consider both matched and mismatched cross-media pairs to preserve the relative similarity ranking information. Besides, we also employ semantic constraint to preserve the semantically discriminative ability during correlation learning process. Therefore, more precise fine-grained cross-media correlation can be captured to improve retrieval accuracy. Result We conduct experiments on 2 widely-used cross-media datasets, including Wikipedia and Pascal Sentence datasets, which take totally 10 state-of-the-art methods for comprehensive comparisons to verify the effectiveness of our proposed CHRAN approach. We perform cross-media retrieval with 2 kinds of retrieval tasks, including retrieving text by image as well as retrieving image by text, and we adopt mean average precision (MAP) score as the evaluation metric. Besides, we also conduct baseline experiments to verify the contribution of weight-sharing constraint as well as cross-media attention modeling. The experimental results show that our proposed approach achieves the best MAP scores as 0.469 and 0.575 on 2 datasets, and outperforms the state-of-the-art methods. Conclusion The proposed approach can effectively learn fine-grained cross-media correlation more precisely. Compared with existing methods that mainly model the pairwise correlation and ignore the fine-grained context information, our proposed hierarchical recurrent network can fully capture both intra-media and inter-media fine-grained context information with cross-media co-attention mechanism, which can further promote the accuracy of cross-media retrieval.