
In real-world scenarios, many robotic manipulation tasks are hindered by occlusions and limited fields of view, posing significant challenges for passive observation-based models that rely on fixed or wrist-mounted cameras.
In this paper, we investigate the problem of robotic manipulation under limited visual observation and propose a task-driven asynchronous active vision-action model. Our model serially connects a camera Next-Best-View(NBV) policy with a gripper Next-Best-Pose(NBP) policy, and trains them in a sensor-motor coordination framework using few-shot reinforcement learning. This approach enables the agent to reposition a third-person camera to actively observe the environment based on the task goal, and subsequently determine the appropriate manipulation actions.
Our model uses a dual-agent structure, with a Next-Best-View (NBV) agent selecting optimal viewpoints and a Next-Best-Pose (NBP) agent determining gripper actions based on those viewpoints. It alternates between sensor and motor actions to actively interact with the environment and achieve task goals.
We trained and evaluated our model on RLBench, a widely used public benchmark for robotic manipulation. We selected 8 tasks from RLBench following two criteria: (1) each task must include at least one RLBench default viewpoint with significant visual occlusion, and (2) no single fixed or wrist-mounted camera provides consistently low occlusion across the selected tasks. We compared our model against three baseline categories:
For the first category, we compared our model to C2FARM (C2F) using static front, overhead, wrist, and front + wrist (f+w) camera settings to evaluate the effectiveness of our active vision-action model in handling tasks with limited observability. For the Oracle baseline, we disabled the NBV viewpoint selection ability since full camera views were always accessible. The statistical results on the selected tasks are shown in the table below.
The figure below illustrates the learned camera viewpoints for various tasks, comparing demo viewpoints (red squares), Behavior Cloning (BC) learned viewpoints (green circles), and our model's learned viewpoints (blue triangles). Our model demonstrates more task-relevant and adaptive viewpoint distributions: tasks like basketball_in_hoop and put_rubbish_in_bin exhibit focused clustering, while tasks like meat_on_grill and open_drawer showcase broader coverage, reflecting enhanced generalization through interaction-driven learning. This highlights our model's ability to learn beyond imitation, achieving higher task performance compared to BC.
Success: Case 1
Success: Case 2
Success: Case 3
Failure Case
Camera View
Camera View
Camera View
Camera View
Success: Case 1
Success: Case 2
Success: Case 3
Failure Case
Camera View
Camera View
Camera View
Camera View
Success: Case 1
Success: Case 2
Success: Case 3
Failure Case
Camera View
Camera View
Camera View
Camera View
Success: Case 1
Success: Case 2
Success: Case 3
Failure Case
Camera View
Camera View
Camera View
Camera View
Success: Case 1
Success: Case 2
Success: Case 3
Failure Case
Camera View
Camera View
Camera View
Camera View
Success: Case 1
Success: Case 2
Success: Case 3
Failure Case
Camera View
Camera View
Camera View
Camera View
Success: Case 1
Success: Case 2
Success: Case 3
Failure Case
Camera View
Camera View
Camera View
Camera View
Success: Case 1
Success: Case 2
Success: Case 3
Failure Case
Camera View
Camera View
Camera View
Camera View
@article{wang2024observe,
title={Observe Then Act: Asynchronous Active Vision-Action Model for Robotic Manipulation},
author={Wang, Guokang and Li, Hang and Zhang, Shuyuan and Liu, Yanhong and Liu, Huaping},
journal={arXiv preprint arXiv:2409.14891},
year={2024}
}