Word2Wave: Language Driven Mission Programming for Efficient Subsea Deployments of Marine Robots

Ruo Chen, David Blow, Adnan Abdullah, Md Jahidul Islam
2025 IEEE International Conference on Robotics & Automation (ICRA 2025), Atlanta, USA

Arxiv Pre-print Video

Overview

Our team introduces Word2Wave (W2W), a Small Language Model (SLM)-based framework for real-time AUV programming in subsea missions. It provides an interactive Human Machine Interface (HMI) that uses natural human speech patterns to generate subsea mission plans to perform autonomous remote operations. The existing subsea

robotics technologies offer predefined planners that require manual configuration of mission parameters in a complex software interface. It is extremely challenging and tedious to program complex missions spontaneously, even for skilled technicians, especially on an undulating vessel when time is of the essence. Natural language-based interfaces can address these limitations by making mission programming more user-friendly and efficient. The W2W framework includes:

A set of novel language rules and command structures for efficient language-to-mission mapping
A GPT-based promt engineering module for training data generation
A small language model based sequence to sequence learning pipeline for mission command generation
A novel user interface for 2D mission map visualization and HMI

With these features, the proposed W2W framework takes a step forward to our overarching goal of integrating human-machine dialogue for embodied reasoning and hands-free mission programming of marine robots.

W2W Language Design

Language Commands and Rules

Designing mission programming languages for subsea robots involves unique considerations due to the particular requirements of marine robotics applications. In Word2Wave, our primary objective is to enable spontaneous human-machine interaction by natural language. We also want to ensure that the high-level abstractions in the W2W language integrate with the existing industrial interfaces and simulation environments for seamless adaptations.

We design the language rules of Word2Wave with seven atomic commands to support a wide range of subsea mission plans, particularly focusing on subsea surveying, mapping, and inspection tasks.

Their intended use cases are as follows:

Start/End(
S/E
): is intended to start or end a mission at a given latitude and longitude with respect to true North and West, respectively.
Move(
Mv
): is designed to move the AUV to a specified distance (meters) at a given bearing (w.r.t. North) and speed (m/s).
Track(
Tr
): is used to plan a set of parallel lines given a direction orthogonal to the current bearing, spacing between each line, and ending distance
Adjust(
Az
): generates a waypoint to adjust the AUV's target depth or altitude at its current location.
Circle(
Cr
): creates a waypoint commanding the AUV to circle around the position for a number of turns at a given radius in a counter/clockwise direction
Spiral(
Sp
): starts at a central point and then expands outwards over a series of turns to a specifc radius with the direction set to counter/clockwise.

W2W can generate arbitrary mission patterns with varying complexity using the language commands. The only limitations are the maneuver capabilities of the host AUV. Mission parameters for each movement command are chosen based on commonly used terms to ensure compatibility; we particularly explore four most widely used mission patterns. These atomic patterns can be further combined to plan multi-phase composite missions for a given scenario.

Lawnmower (also known as boustrophedon) patterns are ideal for surveying large areas over subsea structures. It is not suited for missions requiring intricate movements or targeted actions. These patterns are most commonly used for sonar-based mapping, as they offer even coverage with a simple and efficient route over a large area.
Polygonal routes are best suited for irregular terrain as they provide more precise control over the AUV trajectory. They are more flexible as polygons can be tailored towards specific mission requirements and adapt to local terrains. Hence, they are used for missions involving inspecting known landmarks or targeted waypoints.

Ripple patterns are defined as a series of concentric circles with either or both varying radii and depths. These are better suited when coverage is needed over a specific area and when equal spacing is important when collecting data. They are best suited for missions where sampling in varying depths of the water column is required.
Spiral paths allow for more concentrated coverage over a specific area. It operates similarly to the ripple pattern but allows for a smooth, continuous trajectory that either radiates outwards or converges to a specific point. These are commonly used for search missions requiring high-resolution coverage of a specific point.

UI for Language to Mission Mapping

Real-time visual feedback is essential for interactive verification of language to mission translation. To achieve this, we develop a 2D mission visualizer in W2W that places the human language commands into mission paths on a map for confirmation. Specifically, we generate a Leaflet map based UI on the given GPS coordinates; we further integrate options for subsequent interactions on the map, such as icons, zoom level, and map movement or corrections. All individual tokens can be visualized on separate layers for further mission adaptations by the user. Upon confirmation, this mission map can be loaded onto the AUV for the subsea deployment.
The W2W mission text used to generate this map:
“Start at 38.7969° N, 75.1538° W, Circle for a turn at a radius of 10m in a clockwise direction at an altitude of 1m. Move south 30 m and then Move south 10m. Move south for 100m and then Track left for 100m at a spacing of 14m. End at 38.7968° N, 75.1535° W”.

Experimental Analysis

Language Model Evaluation

We use a randomized 80-5 percentile split for training and validation; the remaining 15% samples are used for testing. During training, each line in the dataset is inserted into a tokenizer to extract embedded information. The training is conducted over 60 epochs on a machine with 32GB of RAM and a single Nvidia RTX 3060 Ti GPU with 8GB of memory. The training is halted when the validation loss consistently remains below a threshold of 0.2.

We adapt a sequence-to-sequence learning pipeline based on the T5-Small model. For performance baseline and ablation experiments, we investigate two other SOTA models of the same genre: Bart-Large and MarianMt. We compare their performance based on model accuracy, robustness, and computational efficiency. To perform this analysis in an identical setup, we train and validate each model on the same dataset until validation loss reaches a plateau. For accuracy, we use two widely used metrics: BLEU (bilingual evaluation understudy) and METEOR (metric for evaluation of translation with explicit ordering). BLEU metric offers a language-independent understanding of how close a predicted sentence is to its ground truth. METEOR additionally considers the order of words during evaluation. It evaluates machine translation output based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision.

The MarianMT and T5-Small models offer more accurate and consistent scores when trained on a targeted dataset compared to BART-Large. We hypothesize that LLMs like BART-Large need more comprehensive datasets and are suited for general-purpose learning. On the other hand, T5-Small has marginally lower BLEU scores compared to MarianMT, while it offers better METEOR values at a significantly faster inference rate. T5-Small only has about 60M parameters, offering 2x faster runtime while ensuring comparable accuracy and robustness as MarianMT. Further inspection reveals that MarianMT often randomizes the order of generated mission commands. T5-Small does not suffer from these issues, demonstrating a better balance between robustness and efficiency.

For a qualitative analysis, we evaluate the qualitative outputs of T5-Small for W2W language commands based on the number of inaccurate tokens generation across the whole test set of 200 samples. we categorize these into: missed tokens (failed to generate), erroneous tokens (incorrectly generated), and hallucinated (extraneously generated) tokens. Of all the error types, we found that the Adjust commands are hallucinated at a disproportionately greater rate. We hypothesize that it happens due to some bias learned by the model, causing it to associate changes in depth or altitude as an Adjust command. Besides, we observed relatively high missed token counts for Move and Circle commands. Track is another challenging command that suffers from high error rates for token generation. Nevertheless, 89% of tokens are accurately parsed from unseen examples, which we found to be enough for real-time mission programming by human participants.

User Interaction Study

We assess the usability benefits of W2W compared to the NemoSens AUV interface, which we consider as the baseline. A total of 15 individuals between the ages of 18 to 36 participated in our study; three of them were familiar with subsea mission programming and deployments, whereas the other 12 people had no prior experience. Individuals were first introduced to both the W2W UI and Nemosens AUV interface for subsea mission programming. Then, they are asked to program three separate missions on these interfaces. As a quantitative measure, the total time taken to program each mission was recorded; they also completed a user survey to evaluate the degree of user satisfaction and ease of use between the two interfaces.
The participants rated W2W's system usability at 76.25, more than twice the SUS score of the baseline AUV programming interface. The participants generally expressed that the baseline interface is more complex, thus asked for assistance repeatedly. They reported several standard deviations higher scores for various features of W2W. They took less than 10% time for programming missions on W2W, validating that it is more user-friendly and easier to use than the baseline.

Subsea Deployment Scenarios

We use a NemoSens AUV for subsea mission deployments. It is a torpedo-shaped single-thruster AUV equipped with a DVL wayfinder, a down-facing HD camera, and a 450KHz side-scan sonar. As mentioned, the integrated software interface allows us to program various missions and generate waypoints for the AUV to execute in real-time. Once programmed, these waypoints are loaded to the AUV for deployment near the starting location.

In this general practice, W2W interface is intended as an HMI bridge, transferring the human language into the mission map; the rest of the experimental scenarios remain the same. The mission vocabulary and language rules adopted in W2W correspond to valid subsea missions programmed on an actual robot platform. We demonstrate this with several examples from our subsea deployments in the GoM (Gulf of Mexico) and the Delaware Bay, Atlantic Ocean. Targeted inspection or mapping missions can be programmed with only a few basic language commands and then validated on our 2D interface. In particular, the Track and Move commands are powerful tools that allow programming lawnmower and polygonal patterns fairly easily. Besides, Spiral and Circle commands are suited for more localized mission patterns. These intuitions are consistent with our user study evaluations as well.

Acknowledgments

This work is supported in part by the National Science Foundation (NSF) grants #2330416 and #2326159. We are thankful to Dr. Arthur Trembanis, Dr. Herbert Tanner, and Dr. Kleio Baxevani at the University of Delaware for facilitating our field trials at the 2024 Autonomous Systems Bootcamp. We also acknowledge the SUS study participants for helping us conduct the UI evaluation.