Learn to Human-level Control in Dynamic Environment Using Incremental Batch Interrupting Temporal Abstraction

Yuchen Fu1, Zhipeng Xu1, Fei Zhu1,2,3,4, Quan Liu1,3 and Xiaoke Zhou5

  1. School of Computer Science and Technology, Soochow University Shizi
    Street No.1 Box 158, Suzhou, China, 215006
    yuchenfu@suda.edu.cn, 20134227052@stu.suda.edu.cn, fzhufei, quanliug@suda.edu.cn
  2. Provincial Key Laboratory for Computer Information Processing Technology
    Soochow University
  3. Collaborative Innovation Center of Novel Software Technology and Industrialization
  4. Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University
    Changchun, 130012, P.R. China
  5. University of Basque Country
    Spanish
    xzhou001@ikasle.ehu.eus

Abstract

The temporal world is characterized by dynamic and variance. A lot of machine learning algorithms are difficult to be applied to practical control applications directly, while hierarchical reinforcement learning can be used to deal with them. Meanwhile, it is a commonplace to have some partial solutions available, called options, which are learned from knowledge or predefined by the system, to solve sub-tasks of the problem. The option can be reused for policy determination in control. Many traditional semi-Markov decision process methods take advantage of it. But most of them treat the option as a primitive object. However, due to the uncertainty and variability of the environment, they are unable to deal with real world control problems effectively. Based on the idea of interrupting option under the prerequisite for dynamic environment, a Q-learning control method which uses temporal abstraction, named as I-QOption, is introduced. The I-QOption approach combines the idea of interruption with the characteristics of dynamic environment so as to be able to learn and improve control policy in dynamic environment. The Q-learning framework helps to learn from interaction with raw data and achieving human-level control. The I-QOption algorithm is applied to grid world, a benchmark dynamic environment evaluation testing. The experiment results show that the proposed algorithm can learn and improve policy effectively in dynamic environment.

Key words

hierarchical reinforcement learning, option, reinforcement learning, online learning, dynamic environment

Digital Object Identifier (DOI)

https://doi.org/10.2298/CSIS160210015F

Publication information

Volume 13, Issue 2 (June 2016)
Year of Publication: 2016
ISSN: 2406-1018 (Online)
Publisher: ComSIS Consortium

Full text

DownloadAvailable in PDF
Portable Document Format

How to cite

Fu, Y., Xu, Z., Zhu, F., Liu, Q., Zhou, X.: Learn to Human-level Control in Dynamic Environment Using Incremental Batch Interrupting Temporal Abstraction. Computer Science and Information Systems, Vol. 13, No. 2, 561–577. (2016), https://doi.org/10.2298/CSIS160210015F