ASAM: Asynchronous Self-Attention Model for Visual Question Answering

Han Liu1, Dezhi Han1, Shukai Zhang1, Jingya Shi1, Huafeng Wu2, Yachao Zhou3 and and Kuan-Ching Li4

  1. College of Information Engineering, Shanghai Maritime University
    Shanghai 201306, China
    liuhanshmtu@163.com, dzhan@shmtu.edu.cn, zhang shukai11@163.com, jingyashi00@163.com
  2. Merchant Marine College, Shanghai Maritime University
    Shanghai 201306, China
    hfwu@shmtu.edu.cn
  3. Shanghai Anheng Times Information Technology Co., Ltd.
    Shanghai 200131, China
    anna.zhou@dbappsecurity.com.cn
  4. Dept of Computer Science and Information Engineering, Providence University
    Taiwan, China
    kuancli@pu.edu.tw

Abstract

Visual Question Answering (VQA) is an emerging field of deep learning that combines image and question features and generates collaborative feature representations for classification by uniquely fusing the components. To enhance the effectiveness of models, it is crucial to fully utilize the semantic information from both text and vision. Some researchers have improved the accuracy of the model's training by either adding new features or enhancing the model’s ability to extract more detailed information. However, these methods have made experimentation more challenging and expensive. We propose a model called asynchronous selfattention model (ASAM) that makes use of an asynchronous self-attention component and a controller, integrating the asynchronous self-attention mechanism and collaborative attention mechanism effectively to leverage the rich semantic information of the underlying visuals. It realizes an end-to-end training framework that can extract and exploit the rich representational information of the underlying visual images while performing coordinated attention with text features, as it does not over-emphasize fine-grained but finds a balance within it, thus allowing the model to learn more valuable information. Extensive ablation experiments were conducted on the proposed ASAM using the VQA v2 dataset to verify its effectiveness. The results of the experiments demonstrate that the proposed model outperforms other state-of-the-art models, without increasing the model complexity and the number of parameters.

Key words

Visual Question Answering, Asynchronous Self-Attention, Deep Collaborative, Controller

Digital Object Identifier (DOI)

https://doi.org/10.2298/CSIS240321003L

Publication information

Volume 22, Issue 1 (January 2025)
Year of Publication: 2025
ISSN: 2406-1018 (Online)
Publisher: ComSIS Consortium

Full text

DownloadAvailable in PDF
Portable Document Format

How to cite

Liu, H., Han, D., Zhang, S., Shi, J., Wu, H., Zhou, Y., Li, a. K.: ASAM: Asynchronous Self-Attention Model for Visual Question Answering. Computer Science and Information Systems, Vol. 22, No. 1, 199–217. (2025), https://doi.org/10.2298/CSIS240321003L