ClipRover: Zero-shot Vision-Language Exploration and Target Discovery by Mobile Robots

Yuxuan Zhang¹, Adnan Abdullah², Sanjeev J. Koppal*³, Md Jahidul Islam*⁴

^{1, 3} FOCUS Laboratory, Dept. of ECE, University of Florida ^{2, 4} RoboPI Laboratory, Dept. of ECE, University of Florida * Contributed Equally

Pre-print Demo

Overview

**Illustration of an ongoing exploration and target discovery task by the proposed ClipRover system.**

Vision-language navigation (VLN) has emerged as a promising paradigm, enabling mobile robots to perform zero-shot inference and execute tasks without specific pre-programming. However, current systems often separate map exploration and path planning, with exploration relying on inefficient algorithms due to limited (partially observed) environmental information.

In this project, we present a novel navigation pipeline named ClipRover for simultaneous exploration and target discovery in unknown environments, leveraging the capabilities of a vision-language model (VLM) named CLIP. Our approach requires only monocular vision and operates without any prior map or knowledge about the target. The VLM framework adopts a modular architecture organized into three key stages: perception, correlation, and decision. This modular design enables the decomposition of complex system-level challenges into smaller, manageable sub-problems, allowing for iterative improvements to each stage. For comprehensive evaluations, we design the functional prototype of a UGV (unmanned ground vehicle) system named Rover Master, a customized platform for general-purpose VLN tasks. We integrate and deploy the ClipRover pipeline on Rover Master to evaluate its throughput, obstacle avoidance capability, and trajectory performance across various real-world scenarios.

Rover Master: The Novel Robotic Platform

We develop a novel robotic platform Rover Master as a low-power and low-cost UGV platform that is portable and scalable for 2D navigation tasks. The major sensors and actuation components are shown in the above figure. The platform includes a monocular RGB camera and a 2D LiDAR for exteroceptive perception. Four independent wheel assemblies are responsible for actuation; each wheel assembly is modular and self-contained, i.e., it consists of a gearbox, a brushless DC motor, and a suspension system. It also includes a pseudo odometer that uses telemetry data from an electronic speed controller (ESC) which drives the motors. Additionally, the driver stack consists of a flight-controller that report onboard sensory data for planning and navigation.

In our setup, we configured the platform to: (i) deliver sufficient computational power to handle a general-purpose vision-language model and other computationally intensive tasks; (ii) have enough mechanical stability to hold the camera in an elevated position without excess vibrations; and (iii) support 3-DOF motions (forward/backward, sideways and rotation).

ClipRover: Navigation Pipeline

We introduce a modular VLM pipeline that integrates stages for perception, planning, and navigation. The core computational components of the proposed ClipRover navigation pipeline are depicted as a data flow architecture in the overview figure.

The front-end, i.e., the CLIP vision encoder, processes raw input frames, divides them into tiles, and encodes these tiles into embeddings - numerical vectors representing semantic meanings. Additionally, the standard deviation for each tile is computed and combined with the model's predictions. This metric serves as an indicator of the amount of information present in each tile, proving particularly useful in scenarios where the robot encounters feature-poor, uniformly colored objects such as walls, doors, or furniture.

Then, middleware components take the visual embeddings generated by the frontend as input and produce scores that carry specific semantic interpretations. These scores are typically derived by correlating the input embeddings with the middleware's internal database. The meanings of these scores can vary depending on the application's requirements. In this study, the middleware generates three types of scores: navigability, familiarity, and target confidence.

Lastly, at the backend, those scores from middlewares are used to make motion decisions. It is designed to be adaptable, allowing the integration of different algorithms tailored to various applications and environments. For this study, a minimal backend was implemented with three operational modes: basic navigation, look-around, and target lock. These modes are dynamically activated or deactivated based on the provided scores.

**An overview of the ClipRover architecture.**

**A state diagram illustrating the decision-making process.**

For intelligent navigation decision, a motion mixer is introduced as the baseline correlation-to-motion translator. It takes all three scores for each tile and moves toward highly navigable yet less familiar locations while avoiding areas with minimal texture. To handle non-trivial scenarios, the decision module incorporates two additional functionalities: (i) trap detection, enabling the robot to identify and escape from potential dead-ends, and (ii) look-around, allowing it to reorient itself in complex environments. The overall decision-making process is illustrated in the state diagram.

Trap detection. In certain scenarios, the robot may encounter a dead end, where no navigable path is visible within its FOV. Such situations are defined as trapped states, which can occur under two conditions: (a) the proximity switch asserts a halt signal for a specified duration, or (b) the cumulative travel distance, as measured by odometry, falls below a defined threshold over a given period. Empirically, the system flags the robot as trapped if it travels less than 0.2 meters within the past 5 seconds.

Look around.To recover from a trapped state, a look-around operation is triggered that enables the robot to gain situational awareness by performing a 360° rotation while collecting Nav scores associated with different headings. A Gaussian convolution is then applied to these scores to identify the most navigable direction. the look around mechanism prioritizes a direction different from the original heading to maximize explored regions.

The above figure illsutrates look-around operations during a demonstrative task; candidate directions are depicted as dashed lines (C_i), where candidates with smaller indices are assigned a higher priority. The plots are generated from correlation scores without utilizing additional sensory data. (a) Initial look-around at the center of the map: the system does not incorporate familiarity scores since the familiarity database is uninitialized; thus, the navigability score predominantly influences the robot's decision. (b) The robot encountered a dead-end and was temporarily trapped; a look-around operation enabled it to identify a navigable path and resume exploration. (c) The robot was trapped due to a false positive navigability perception caused by a transparent object; with a look-around, the system successfully recovered from the false positive and continued its exploration. (d) The target was located nearby in this scenario; through the look-around, the robot was able to identify the target and assign it as the highest-priority candidate.

Experimental Analyses

Setup. Real-world experiments were conducted in an indoor workspace, as depicted in the testbed figure. This environment was selected for its cluttered layout and complex details, which include numerous obstacles, potential traps, and loops, providing a challenging setting for comprehensive analysis. As shown in theis figure, the robot was tasked with exploring the lab space while searching for a designated target - a toy bear approximately 20 cm tall and 10 cm wide, chosen for its distinctive appearance within the test scene. For each task, the source (robot's starting position) and destination (target location) were selected from five predefined regions (e.g., SW, C, NW, NE, SE).

Evaluation criteria. Our experimental trials are designed to evaluate efficiency of exploring an area and success rates of finding a target with no prior map or knowledge about the target. To quantify these, we use the total distance traveled (trajectory length) before approaching the target as the core metric.

Benchmark comparison. We compare the performance of the proposed ClipRover system with six widely used map-traversal and path-finding algorithms. For fair comparisons, a novel 2D simulation framework, RoboSim2D, is developed. It utilizes 2D LiDAR maps generated from recorded scans of the same environment used in real-world experiments. We consider three map traversal algorithms (Random Walk, Wall Bounce, and Wave Front) and three Bug algorithms (Bug0, Bug1, and Bug2) for comparison. Please refer to the paper for their implementation details.

**The testbed used for real world experiments**.

Results & Comparison

We conduct extensive real-world experiments with over 60 trials based on various combinations of source-target locations. For the proposed ClipRover system, three trials were performed for each origin-target pair. In comparison, 200 trials were recorded for each random walk experiment, while 180 evenly distributed headings were recorded for each combination of the wall bounce runs.

Samples of the trajectories traversed by ClipRover and other algorithms are illustrated below overlaid on a 2D map of the environment. Under normal navigation mode, ClipRover exhibits a human-like motion strategy by navigating near the center of open spaces, enhancing exploration efficiency, and reducing the likelihood of becoming trapped by obstacles. In contrast, path-finding algorithms often follow the contours of obstacles due to their lack of semantic scene understanding. Additionally, our algorithm demonstrates a preference for unexplored (unfamiliar) areas over previously visited (familiar) regions, contributing to improved efficiency compared to traditional map-traversal algorithms.

Out of all 60 trials, 3 failure cases were recorded, yielding 95% overall success rate for ClipRover. Among them, two were caused by the trajectory length exceeding the upper limit, while the other one was caused by the robot getting jammed by a reflective metal tank, which was not detected by either the VLN pipeline or the LiDAR proximity switch. To ensure comparability across different source-target pairs, the travel distance for each row is normalized by the baseline travel distance (D_baseline) and then aggregated into the following table.

**Quantitative performance comparison of ClipRover and other algorithms.**
Method	Success Rate	Mean Path Length	Additional Requirements
ClipRover 💥	95.00%	2.66 ± 1.73	None
Random Walk	50.00%	10.77 ± 6.26
Random Walk	80.00%	20.56 ± 14.65
Wall Bounce	50.00%	9.58 ± 5.58
Wall Bounce	80.00%	18.71 ± 13.66
Wave Front	50.00%	13.86 ± 5.31
Wave Front	80.00%	27.30 ± 10.11
Bug0	45.00%	1.14 ± 0.08	Target location
Bug1	100.00%	6.34 ± 3.98	Target location, Memory of trajectory
Bug2	80.00%	1.90 ± 1.51	Target location, m-line Detection

We observe that ClipRover averages 2.66 × D_baseline travel distance before reaching the target, while the next best score is from Wall Bounce, with 9.58 × D_baseline at a significantly lower success rate (50%). Note that, it achieves this performance gains despite having the same prior information as the map traversal algorithms. On the other hand, although the Bug* algorithms uses additional information about the environment map and target location, ClipRover (95%) still outperforms outperforms Bug1 (6.34 × D_baseline) by trajectory lengths, and offers significantly higher success rates than Bug0 (45%) and Bug2 (80%) algorithms.

As summarized in the table, despite the inherent disadvantage of operating without prior target knowledge or precise localization, the system achieves performance comparable to path-finding algorithms. Notably, in many scenarios, it surpasses path-finding algorithms in either trajectory efficiency or success rates. These results highlight the effectiveness of the proposed pipeline for simultaneous exploration and target discovery tasks.

Acknowledgments

This work is supported in part by the National Science Foundation (NSF) grants #2330416, #19244; the Office of Naval Research (ONR) grants #N000142312429, #N000142312363; and the University of Florida (UF) ROSF research grant #132763.