Multimodal Deep Learning-based Feature Fusion for Object Detection in Remote Sensing Images

Shoulin Yin¹, Qunming Wang², Liguo Wang³, Mirjana Ivanović⁴ and Hang Li⁵

College of Information and Communication Engineering, Harbin Engineering University
Harbin, 150001 China
yslin@synu.edu.cn
College of Surveying and Geo-informatics, Tongji University
College of Information and Communications Engineering, Dalian Minzu University
Dalian, 116600, China
wangliguo@hrbeu.edu.cn
Faculty of Sciences, University of Novi Sad, Serbia
mira@dmi.uns.ac.rs
Software College, Shenyang Normal University
Shenyang, 110034 China
lihang@synu.edu.cn

Abstract

Object detection is an important computer vision task, which is developed from image classification task. The difference is that it is no longer only to classify a single type of object in an image, but to complete the classification and positioning of multiple objects that may exist in an image at the same time. Classification refers to assigning category labels to the object, and positioning refers to determining the vertex coordinates of the peripheral rectangular box of the object. Therefore, object detection is more challenging and has broader application prospects, such as automatic driving, face recognition, pedestrian detection, medical detection etc,. Object detection can also be used as the research basis for more complex computer vision task such as image segmentation, image description, object tracking and action recognition. In traditional object detection, the feature utilization rate is low and it is easy to be affected by other environmental factors. Hence, this paper proposes a multimodal deep learning-based feature fusion for object detection in remote sensing images. In the new model, cascade RCNN is the backbone network. Parallel cascade RCNN network is utilized for feature fusion to enhance feature expression ability. In order to solve the problem of different segmentation shapes and sizes, the central part of the network adopts multi-coefficient cascaded hollow convolution to obtain multi-receptive field features without using pooling mode and preserving image information. Meanwhile, an improved selfattention combined receptive field strategy is used to obtain both low-level features with marginal details and high-level features with global semantics. Finally, we conduct experiments on DOTA set including ablation experiments and comparison experiments. The experimental results show that the mean Average Precision (mAP) and other indexes have been greatly improved, and its performance is better than the state-of-the-art detection algorithms. It has a good application prospect in the remote sensing image object detection task.

Key words

Object detection, remote sensing image, multimodal deep learning, feature fusion

Digital Object Identifier (DOI)

https://doi.org/10.2298/CSIS241110011Y

Publication information

Volume 22, Issue 1 (January 2025)
Year of Publication: 2025
ISSN: 2406-1018 (Online)
Publisher: ComSIS Consortium

Full text

Download Available in PDF
Portable Document Format

How to cite

Yin, S., Wang, Q., Wang, L., Ivanović, M., Li, H.: Multimodal Deep Learning-based Feature Fusion for Object Detection in Remote Sensing Images. Computer Science and Information Systems, Vol. 22, No. 1, 327–344. (2025), https://doi.org/10.2298/CSIS241110011Y