Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation

机器人操作策略在遇到新颖任务或物体实例时表现不令人满意。因此,自动检测和自我纠正失败动作的能力对于实用的机器人系统至关重要。最近,多模态大型语言模型(MLLMs)在视觉指令跟随和各种任务中显示出的前景已经引起了人们的关注。为了将通用MLLM作为端到端机器人代理,我们引入了一种自校正(SC)MLLM,使我们的模型不仅能够预测末端执行器姿态,还具有自主识别并纠正失败动作的能力。具体来说,我们首先进行参数高效的微调,以增强MLLM的姿预测能力,将其重新表述为语言建模问题。当面临执行失败时,我们的模型学会识别低级动作错误原因(即位置和旋转错误),并从专家那里自适应地寻求反馈。根据反馈,SC-MLLM重新思考当前失败场景并生成修正的动作。此外,我们设计了一个连续策略学习方法,用于成功修正的样本,增强模型对当前场景的适应性,并降低专家干预的频率。为了评估我们的SC-MLLM,我们在仿真和现实世界环境中进行了广泛的实验。与之前的最先进机器人MLLM(ManipLLM)相比,SC-MLLM的操纵精度显著提高,在可见物体类别的看到动作中从57%提高到79%,在未见过的全新类别的看到动作中从47%提高到69%。

Robot manipulation policies have shown unsatisfactory action performance when confronted with novel task or object instances. Hence, the capability to automatically detect and self-correct failure action is essential for a practical robotic system. Recently, Multimodal Large Language Models (MLLMs) have shown promise in visual instruction following and demonstrated strong reasoning abilities in various tasks. To unleash general MLLMs as an end-to-end robotic agent, we introduce a Self-Corrected (SC)-MLLM, equipping our model not only to predict end-effector poses but also to autonomously recognize and correct failure actions. Specifically, we first conduct parameter-efficient fine-tuning to empower MLLM with pose prediction ability, which is reframed as a language modeling problem. When facing execution failures, our model learns to identify low-level action error causes (i.e., position and rotation errors) and adaptively seeks prompt feedback from experts. Based on the feedback, SC-MLLM rethinks the current failure scene and generates the corrected actions. Furthermore, we design a continuous policy learning method for successfully corrected samples, enhancing the model’s adaptability to the current scene configuration and reducing the frequency of expert intervention. To evaluate our SC-MLLM, we conduct extensive experiments in both simulation and real-world settings. SC-MLLM agent significantly improve manipulation accuracy compared to previous state-of-the-art robotic MLLM (ManipLLM), increasing from 57\% to 79\% on seen object categories and from 47\% to 69\% on unseen novel categories.

https://arxiv.org/abs/2405.17418

https://arxiv.org/pdf/2405.17418.pdf

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注