OTA

In real-world scenarios, many robotic manipulation tasks are hindered by occlusions and limited fields of view, posing significant challenges for passive observation-based models that rely on fixed or wrist-mounted cameras.

In this paper, we investigate the problem of robotic manipulation under limited visual observation and propose a task-driven asynchronous active vision-action model. Our model serially connects a camera Next-Best-View(NBV) policy with a gripper Next-Best-Pose(NBP) policy, and trains them in a sensor-motor coordination framework using few-shot reinforcement learning. This approach enables the agent to reposition a third-person camera to actively observe the environment based on the task goal, and subsequently determine the appropriate manipulation actions.

Our model uses a dual-agent structure, with a Next-Best-View (NBV) agent selecting optimal viewpoints and a Next-Best-Pose (NBP) agent determining gripper actions based on those viewpoints. It alternates between sensor and motor actions to actively interact with the environment and achieve task goals.

We trained and evaluated our model on RLBench, a widely used public benchmark for robotic manipulation. We selected 8 tasks from RLBench following two criteria: (1) each task must include at least one RLBench default viewpoint with significant visual occlusion, and (2) no single fixed or wrist-mounted camera provides consistently low occlusion across the selected tasks. We compared our model against three baseline categories:

Visual manipulation models that utilize static third-person and wrist-mounted cameras, active third-person camera-based models.
Behavior Cloning (BC) to test its ability to learn new sensor-motor skills beyond demonstration imitation. Additionally, we included a Random Viewpoint model, which shares the same architecture as ours but with the NBV policy network disabled (learning rate set to zero). This setup assesses the performance gain from active viewpoint adjustment versus random selection.
Oracle, which utilizes a fully fused image from all default camera views in RLBench.

For the first category, we compared our model to C2FARM (C2F) using static front, overhead, wrist, and front + wrist (f+w) camera settings to evaluate the effectiveness of our active vision-action model in handling tasks with limited observability. For the Oracle baseline, we disabled the NBV viewpoint selection ability since full camera views were always accessible. The statistical results on the selected tasks are shown in the table below.

The figure below illustrates the learned camera viewpoints for various tasks, comparing demo viewpoints (red squares), Behavior Cloning (BC) learned viewpoints (green circles), and our model's learned viewpoints (blue triangles). Our model demonstrates more task-relevant and adaptive viewpoint distributions: tasks like basketball_in_hoop and put_rubbish_in_bin exhibit focused clustering, while tasks like meat_on_grill and open_drawer showcase broader coverage, reflecting enhanced generalization through interaction-driven learning. This highlights our model's ability to learn beyond imitation, achieving higher task performance compared to BC.

BibTeX


@article{wang2024observe,
    title={Observe Then Act: Asynchronous Active Vision-Action Model for Robotic Manipulation},
    author={Wang, Guokang and Li, Hang and Zhang, Shuyuan and Liu, Yanhong and Liu, Huaping},
    journal={arXiv preprint arXiv:2409.14891},
    year={2024}
}

Observe Then Act: Asynchronous Active Vision-Action Model for Robotic Manipulation

Abstract

Method

Experiment Results

Rollout Videos

Basketball in Hoop

Meat off Grill

Meat on Grill

Open Drawer

Press Switch

Put Rubbish in Bin

Take Money out Safe

Unplug Charger

BibTeX