ASAM: Asynchronous Self-Attention Model for Visual Question Answering

Han Liu¹, Dezhi Han¹, Shukai Zhang¹, Jingya Shi¹, Huafeng Wu², Yachao Zhou³ and and Kuan-Ching Li⁴

College of Information Engineering, Shanghai Maritime University
Shanghai 201306, China
liuhanshmtu@163.com, dzhan@shmtu.edu.cn, zhang shukai11@163.com, jingyashi00@163.com
Merchant Marine College, Shanghai Maritime University
Shanghai 201306, China
hfwu@shmtu.edu.cn
Shanghai Anheng Times Information Technology Co., Ltd.
Shanghai 200131, China
anna.zhou@dbappsecurity.com.cn
Dept of Computer Science and Information Engineering, Providence University
Taiwan, China
kuancli@pu.edu.tw

Abstract

Visual Question Answering (VQA) is an emerging field of deep learning that combines image and question features and generates collaborative feature representations for classification by uniquely fusing the components. To enhance the effectiveness of models, it is crucial to fully utilize the semantic information from both text and vision. Some researchers have improved the accuracy of the model's training by either adding new features or enhancing the model’s ability to extract more detailed information. However, these methods have made experimentation more challenging and expensive. We propose a model called asynchronous selfattention model (ASAM) that makes use of an asynchronous self-attention component and a controller, integrating the asynchronous self-attention mechanism and collaborative attention mechanism effectively to leverage the rich semantic information of the underlying visuals. It realizes an end-to-end training framework that can extract and exploit the rich representational information of the underlying visual images while performing coordinated attention with text features, as it does not over-emphasize fine-grained but finds a balance within it, thus allowing the model to learn more valuable information. Extensive ablation experiments were conducted on the proposed ASAM using the VQA v2 dataset to verify its effectiveness. The results of the experiments demonstrate that the proposed model outperforms other state-of-the-art models, without increasing the model complexity and the number of parameters.

Key words

Visual Question Answering, Asynchronous Self-Attention, Deep Collaborative, Controller

Digital Object Identifier (DOI)

https://doi.org/10.2298/CSIS240321003L

Publication information

Volume 22, Issue 1 (January 2025)
Year of Publication: 2025
ISSN: 2406-1018 (Online)
Publisher: ComSIS Consortium

Full text

Download Available in PDF
Portable Document Format

How to cite

Liu, H., Han, D., Zhang, S., Shi, J., Wu, H., Zhou, Y., Li, a. K.: ASAM: Asynchronous Self-Attention Model for Visual Question Answering. Computer Science and Information Systems, Vol. 22, No. 1, 199–217. (2025), https://doi.org/10.2298/CSIS240321003L