Surprise-Adaptive Intrinsic Motivation for Unsupervised Reinforcement Learning

两种熵最小化和熵最大化的(好奇心)目标在不同的环境中已经被证明是有效的。然而,单独使用任何一种方法都不能使代理在不同的环境中 consistently 学习智能行为。为了找到一种基于熵的单一方法,以鼓励在任何环境中出现 emergence behaviors,我们提出了一个可以根据熵条件在线调整目标的代理。我们将选择建模为一个多臂老虎机问题。我们设计了一种新的内在反馈信号老虎机,捕捉了代理在环境中控制熵的能力。我们证明了,这样的代理可以在高熵和低熵 regimes 中学会控制熵并展示 emergence behaviors,并且可以在基准任务中学习到精湛的行为。训练代理的视频和总结结果可以在我们的项目页上找到这个链接。

Both entropy-minimizing and entropy-maximizing (curiosity) objectives for unsupervised reinforcement learning (RL) have been shown to be effective in different environments, depending on the environment’s level of natural entropy. However, neither method alone results in an agent that will consistently learn intelligent behavior across environments. In an effort to find a single entropy-based method that will encourage emergent behaviors in any environment, we propose an agent that can adapt its objective online, depending on the entropy conditions by framing the choice as a multi-armed bandit problem. We devise a novel intrinsic feedback signal for the bandit, which captures the agent’s ability to control the entropy in its environment. We demonstrate that such agents can learn to control entropy and exhibit emergent behaviors in both high- and low-entropy regimes and can learn skillful behaviors in benchmark tasks. Videos of the trained agents and summarized findings can be found on our project page this https URL

https://arxiv.org/abs/2405.17243

https://arxiv.org/pdf/2405.17243.pdf

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注