ASAM: Asynchronous Self-Attention Model for Visual Question Answering
- College of Information Engineering, Shanghai Maritime University
Shanghai 201306, China
liuhanshmtu@163.com, dzhan@shmtu.edu.cn, zhang shukai11@163.com, jingyashi00@163.com - Merchant Marine College, Shanghai Maritime University
Shanghai 201306, China
hfwu@shmtu.edu.cn - Shanghai Anheng Times Information Technology Co., Ltd.
Shanghai 200131, China
anna.zhou@dbappsecurity.com.cn - Dept of Computer Science and Information Engineering, Providence University
Taiwan, China
kuancli@pu.edu.tw
Abstract
Visual Question Answering (VQA) is an emerging field of deep learning that combines image and question features and generates collaborative feature representations for classification by uniquely fusing the components. To enhance the effectiveness of models, it is crucial to fully utilize the semantic information from both text and vision. Some researchers have improved the accuracy of the model's training by either adding new features or enhancing the model’s ability to extract more detailed information. However, these methods have made experimentation more challenging and expensive. We propose a model called asynchronous selfattention model (ASAM) that makes use of an asynchronous self-attention component and a controller, integrating the asynchronous self-attention mechanism and collaborative attention mechanism effectively to leverage the rich semantic information of the underlying visuals. It realizes an end-to-end training framework that can extract and exploit the rich representational information of the underlying visual images while performing coordinated attention with text features, as it does not over-emphasize fine-grained but finds a balance within it, thus allowing the model to learn more valuable information. Extensive ablation experiments were conducted on the proposed ASAM using the VQA v2 dataset to verify its effectiveness. The results of the experiments demonstrate that the proposed model outperforms other state-of-the-art models, without increasing the model complexity and the number of parameters.
Key words
Visual Question Answering, Asynchronous Self-Attention, Deep Collaborative, Controller
Digital Object Identifier (DOI)
https://doi.org/10.2298/CSIS240321003L
Publication information
Volume 22, Issue 1 (January 2025)
Year of Publication: 2025
ISSN: 2406-1018 (Online)
Publisher: ComSIS Consortium
Full text
Available in PDF
Portable Document Format
How to cite
Liu, H., Han, D., Zhang, S., Shi, J., Wu, H., Zhou, Y., Li, a. K.: ASAM: Asynchronous Self-Attention Model for Visual Question Answering. Computer Science and Information Systems, Vol. 22, No. 1, 199–217. (2025), https://doi.org/10.2298/CSIS240321003L