ClipRover: Zero-shot Vision-Language Exploration and Target Discovery by Mobile Robots

1, 3 FOCUS Laboratory, Dept. of ECE, University of Florida  2, 4 RoboPI Laboratory, Dept. of ECE, University of Florida  * Contributed Equally 


Overview


Illustration of an ongoing exploration and target discovery task by the proposed ClipRover system.

Vision-language navigation (VLN) has emerged as a promising paradigm, enabling mobile robots to perform zero-shot inference and execute tasks without specific pre-programming. However, current systems often separate map exploration and path planning, with exploration relying on inefficient algorithms due to limited (partially observed) environmental information.

In this project, we present a novel navigation pipeline named ClipRover for simultaneous exploration and target discovery in unknown environments, leveraging the capabilities of a vision-language model (VLM) named CLIP. Our approach requires only monocular vision and operates without any prior map or knowledge about the target. The VLM framework adopts a modular architecture organized into three key stages: perception, correlation, and decision. This modular design enables the decomposition of complex system-level challenges into smaller, manageable sub-problems, allowing for iterative improvements to each stage. For comprehensive evaluations, we design the functional prototype of a UGV (unmanned ground vehicle) system named Rover Master, a customized platform for general-purpose VLN tasks. We integrate and deploy the ClipRover pipeline on Rover Master to evaluate its throughput, obstacle avoidance capability, and trajectory performance across various real-world scenarios.


Rover Master: The Novel Robotic Platform


We develop a novel robotic platform Rover Master as a low-power and low-cost UGV platform that is portable and scalable for 2D navigation tasks. The major sensors and actuation components are shown in the above figure. The platform includes a monocular RGB camera and a 2D LiDAR for exteroceptive perception. Four independent wheel assemblies are responsible for actuation; each wheel assembly is modular and self-contained, i.e., it consists of a gearbox, a brushless DC motor, and a suspension system. It also includes a pseudo odometer that uses telemetry data from an electronic speed controller (ESC) which drives the motors. Additionally, the driver stack consists of a flight-controller that report onboard sensory data for planning and navigation.

In our setup, we configured the platform to: (i) deliver sufficient computational power to handle a general-purpose vision-language model and other computationally intensive tasks; (ii) have enough mechanical stability to hold the camera in an elevated position without excess vibrations; and (iii) support 3-DOF motions (forward/backward, sideways and rotation).


Experimental Analyses


Setup. Real-world experiments were conducted in an indoor workspace, as depicted in the testbed figure. This environment was selected for its cluttered layout and complex details, which include numerous obstacles, potential traps, and loops, providing a challenging setting for comprehensive analysis. As shown in theis figure, the robot was tasked with exploring the lab space while searching for a designated target - a toy bear approximately 20 cm tall and 10 cm wide, chosen for its distinctive appearance within the test scene. For each task, the source (robot's starting position) and destination (target location) were selected from five predefined regions (e.g., SW, C, NW, NE, SE).

Evaluation criteria. Our experimental trials are designed to evaluate efficiency of exploring an area and success rates of finding a target with no prior map or knowledge about the target. To quantify these, we use the total distance traveled (trajectory length) before approaching the target as the core metric.

Benchmark comparison. We compare the performance of the proposed ClipRover system with six widely used map-traversal and path-finding algorithms. For fair comparisons, a novel 2D simulation framework, RoboSim2D, is developed. It utilizes 2D LiDAR maps generated from recorded scans of the same environment used in real-world experiments. We consider three map traversal algorithms (Random Walk, Wall Bounce, and Wave Front) and three Bug algorithms (Bug0, Bug1, and Bug2) for comparison. Please refer to the paper for their implementation details.

The testbed used for real world experiments.

Results & Comparison


ClipRover

We conduct extensive real-world experiments with over 60 trials based on various combinations of source-target locations. For the proposed ClipRover system, three trials were performed for each origin-target pair. In comparison, 200 trials were recorded for each random walk experiment, while 180 evenly distributed headings were recorded for each combination of the wall bounce runs.

Samples of the trajectories traversed by ClipRover and other algorithms are illustrated below overlaid on a 2D map of the environment. Under normal navigation mode, ClipRover exhibits a human-like motion strategy by navigating near the center of open spaces, enhancing exploration efficiency, and reducing the likelihood of becoming trapped by obstacles. In contrast, path-finding algorithms often follow the contours of obstacles due to their lack of semantic scene understanding. Additionally, our algorithm demonstrates a preference for unexplored (unfamiliar) areas over previously visited (familiar) regions, contributing to improved efficiency compared to traditional map-traversal algorithms.

Out of all 60 trials, 3 failure cases were recorded, yielding 95% overall success rate for ClipRover. Among them, two were caused by the trajectory length exceeding the upper limit, while the other one was caused by the robot getting jammed by a reflective metal tank, which was not detected by either the VLN pipeline or the LiDAR proximity switch. To ensure comparability across different source-target pairs, the travel distance for each row is normalized by the baseline travel distance (Dbaseline) and then aggregated into the following table.


Bug0
Bug1
Bug2
Random Walk
Wall Bounce
Wave Front

Quantitative performance comparison of ClipRover and other algorithms.
Method Success Rate Mean Path Length Additional Requirements
ClipRover 💥 95.00% 2.66 ± 1.73 None
Random Walk 50.00% 10.77 ± 6.26
80.00% 20.56 ± 14.65
Wall Bounce 50.00% 9.58 ± 5.58
80.00% 18.71 ± 13.66
Wave Front 50.00% 13.86 ± 5.31
80.00% 27.30 ± 10.11
Bug0 45.00% 1.14 ± 0.08 Target location
Bug1 100.00% 6.34 ± 3.98 Target location, Memory of trajectory
Bug2 80.00% 1.90 ± 1.51 Target location, m-line Detection

We observe that ClipRover averages 2.66 × Dbaseline travel distance before reaching the target, while the next best score is from Wall Bounce, with 9.58 × Dbaseline at a significantly lower success rate (50%). Note that, it achieves this performance gains despite having the same prior information as the map traversal algorithms. On the other hand, although the Bug* algorithms uses additional information about the environment map and target location, ClipRover (95%) still outperforms outperforms Bug1 (6.34 × Dbaseline) by trajectory lengths, and offers significantly higher success rates than Bug0 (45%) and Bug2 (80%) algorithms.

As summarized in the table, despite the inherent disadvantage of operating without prior target knowledge or precise localization, the system achieves performance comparable to path-finding algorithms. Notably, in many scenarios, it surpasses path-finding algorithms in either trajectory efficiency or success rates. These results highlight the effectiveness of the proposed pipeline for simultaneous exploration and target discovery tasks.


Acknowledgments


This work is supported in part by the National Science Foundation (NSF) grants #2330416, #19244; the Office of Naval Research (ONR) grants #N000142312429, #N000142312363; and the University of Florida (UF) ROSF research grant #132763.