3D Convolutional Long Short-Term Encoder-Decoder Network for Moving Object Segmentation

Anil Turker1 and Ender M. Eksioglu2

  1. ASELSAN Inc.
    Ankara, Turkey
  2. Electronics and Communication Engineering Department, Istanbul Technical University
    Istanbul, Turkey


Moving object segmentation (MOS) is one of the important and wellstudied computer vision tasks that is used in a variety of applications, such as video surveillance systems, human tracking, self-driving cars, and video compression. While traditional approaches to MOS rely on hand-crafted features or background modeling, deep learning methods using Convolution Neural Networks (CNNs) have been shown to be more effective in extracting features and achieving better accuracy. However, most deep learning-based methods for MOS offer scene-dependent solutions, leading to reduced performance when tested on previously unseen video content. Because spatial features are insufficient to represent the motion information, the spatial and temporal features should be used together to succeed in unseen videos. To address this issue, we propose the MOS-Net deep framework, an encoder-decoder network that combines spatial and temporal features using the flux tensor algorithm, 3D CNNs, and ConvLSTM in its different variants. MOS-Net 2.0 is an enhanced version of the base MOS-Net structure, where additional ConvL-STM modules are added to 3D CNNs for extracting long-term spatiotemporal features. In the final stage of the framework the output of the encoder-decoder network, the foreground probability map, is thresholded for producing a binary mask where moving objects are in the foreground and the rest forms the background. In addition, an ablation study has been conducted to evaluate different combinations as inputs to the proposed network, using the ChangeDetection2014 (CDnet2014) which includes challenging videos such as those with dynamic backgrounds, bad weather, and illumination changes. In most approaches, the training and test strategy are not announced, making it difficult to compare the algorithm results. In addition, the proposed method can be evaluated differently as video-optimized or video-agnostic. In video-optimized approaches, the training and test set is obtained randomly and separated from the overall dataset. The results of the proposed method are compared with competitive methods from the literature using the same evaluation strategy. It has been observed that the introduced MOS networks give highly competitive results on the CDnet2014 dataset. The source code for the simulations provided in this work is available online.

Key words

Moving object segmentation, flux tensor, deep learning, spatiotemporal, change detection, foreground segmentation, background subtraction

Digital Object Identifier (DOI)


Publication information

Volume 21, Issue 1 (January 2024)
Year of Publication: 2024
ISSN: 2406-1018 (Online)
Publisher: ComSIS Consortium

Full text

DownloadAvailable in PDF
Portable Document Format

How to cite

Turker, A., Eksioglu, E. M.: 3D Convolutional Long Short-Term Encoder-Decoder Network for Moving Object Segmentation. Computer Science and Information Systems, Vol. 21, No. 1, 363–378. (2024), https://doi.org/10.2298/CSIS230129044T