Underwater caves are challenging environments that are crucial for water resource management, and for our understanding of hydro-geology and history.
Mapping underwater caves is a time-consuming, labor-intensive, and hazardous operation.
For autonomous cave mapping by underwater robots, the major challenge lies in vision-based estimation in the complete absence of ambient light, which results in constantly moving shadows due to the motion of the camera-light setup.
Thus, detecting and following the caveline as navigation guidance is paramount for robots in autonomous cave mapping missions.
The figures above show data collected by our team driving the BlueROV2 along the caveline when it was deployed 300 feet inside an underwater cave system in Orange Grove, Florida, in April.
In this work, we develop a weakly supervised Vision Transformer (ViT)-based learning pipeline for autonomous caveline detection by AUVs.
We present a weakly supervised learning pipeline for rapid model adaptation to data from new locations or cave systems.
We also propose a ViT-based novel model (CL-ViT) for robust and efficient caveline detection, and further develop a post-processing algorithm to filter noisy predictions for smooth map generation.
We validate the utility and effectiveness of such weak supervision for caveline detection and tracking in three different cave locations: USA, Mexico, and Spain.
Experimental results demonstrate that our proposed model, CL-ViT, balances the robustness-efficiency trade-off, ensuring good generalization performance while offering 10+ FPS on single-board (Jetson TX2) devices.
We formulate the problem of caveline detection in the RGB space as a binary image segmentation task, i.e., identifying pixels with caveline as a semantic map.
In our task, the background pixels and caveline pixels are assigned with 0 and 1 labels, respectively.
For data-driven training and evaluation, we extract video frames from our cave exploration experiments conducted in three different locations: the Devil's system in Florida, USA; the Dos Ojos Cenote, QR, Mexico; and the Cueva del Agua in Murcia, Spain.
We grouped the caveline frames from these locations into three datasets, which we term as the Florida, Mexico, and Spain dataset, respectively.
We found that the three cave locations exhibit different caveline characteristics in terms of thickness, color, and background patterns.
The cavelines in Florida are thin and off-white colored, whereas the Mexico caves are the most decorated with yellow colored lines.
Cavelines in Spain are also thick and of orange color.
In general, the main cavelines in popular locations are thicker, while off the main path become thinner; the grey/white colored lines take a darker color over time and often blend with the background patterns.
We identify these variety of challenging cases and prepare 1050 images in each set, totaling 3×1050=3150 instances.
We focused on maximizing variance in the data by including varieties in caveline color, distance, background/waterbody patterns as well as different cave formations (e.g., stalactites, stalagmites, columns) and navigational aids such as arrows and cookies.
Four human participants sorted these image samples and then pixel-annotated the cavelines for ground truth generation, which we utilizes for the training and evaluation of all models.
CL-ViT Learning Pipeline
We develop a lightweight caveline detection model CL-ViT for use by visually guided AUVs in underwater cave mapping and exploration tasks.
To this end, we focus on enabling two important features: (i) robustness to noisy low-resolution inputs because cavelines are only a few pixels wide even in a high-resolution camera feed;
and (ii) efficient inference on single-board embedded platforms.
We attempt to achieve this in CL-ViT model by integrating multi-scale local hierarchical features and global spatial information for efficient pixel-wise segmentation of cavelines.
This figure illustrates the network architecture of CL-ViT, which consists of two major learning components: an efficient encoder-decoder backbone and a ViT-based refinement module.
Choice of Backbones. We incorporate two options for the deep hierarchical feature extraction in CL-ViT: (i) a light model with MobileNetV3 backbone for on-board AUV processing; and (ii) a base model with EfficientNetB5 backbone for offline use, e.g., when human operators on surface control ROVs inside a cave.
The MobileNetV3 is a lightweight CNN-based model designed for resource-constrained platforms.
The encoder contains a series of fully convolutional layers with 16 filters followed by 15 residual bottleneck layers.
We then use a mirrored decoder with six convolutional blocks to map the encoded features into 48 filters of 480×270 resolution (with an input of 960×540×3).
On the other hand, EfficientNet uses a technique called compound coefficient to scale up models in a simple but effective manner. It uniformly scales features in width, depth, and resolution to ensure effective receptive fields for feature extraction. We use EfficientNetB5 which extracts 128 filters of 480×270 resolution from 960×540×3 inputs.
ViT-based Refinement Module. Following feature extraction, we design a ViT-based refinement module to transform and embed the contextual features into an efficient prediction head.
Our idea is to allow each feature position to have consistent receptive fields so that the global spatial information is accurately embedded.
The N=48/128 filters extracted by the backbone are 16×16 convolved and flattened to patch embeddings, which are concatenated with learnable position embeddings.
These embeddings are then propagated to the transformer encoder, which applies a four-layer multi-head attention mapping.
Subsequently, the normalized and 3×3 convolved feature maps are projected (dot product operation) with N output embeddings; the remaining embeddings in the MLP head are dropped.
The selected attention maps then generate the binary caveline segmentation map after a final convolution and upsampling operation at 960×540 resolution.
Weakly-Supervised Iterative Training. Pixel-annotated training data is very scarce for unique problems such as caveline detection in underwater caves.
As discussed earlier, caveline characteristics and background waterbody patterns in each cave locations differ greatly, making it difficult to compile a comprehensive dataset for supervised training.
We address this limitation by a weakly supervised formulation, where model accuracy is improved incrementally on new locations' data.
This speeds up model adaptations during robotics field deployments to a new location by eliminating the need for labeling entire datasets for supervised training.
As illustrated in this figure, we validate this hypothesis on each dataset (i.e., Florida, Spain, and Mexico data) based on the leave-one-out mechanism.
For each of the three cases shown in this figure, the weak supervision is carried out as follows.
The initial model is evaluated on the full test set (of 1050 samples), from which a human expert sorts out the good quality predictions to reinforce the learning in the next phase.
The human expert also selects a set of challenging samples where the model failed, then annotates and combines them into the training set in order to balance distribution positive (accurate) and negative (erroneous) samples.
This process is repeated several times until a satisfactory number of training samples are compiled.
Our experiments reveal that we get 15%-20% good quality predictions in the first phase and another 34%-47% in the second phase. All 1050 images get labelled within 3 phases, where human experts relabelled only 200-250 images as negative samples. Thus the remaining labels are weak labels generated by intermediate sub-optimal models for subsequent weak supervision.
Post-processing. In the post-processing step, we smooth the raw CL-ViT output of binary pixel predictions into continuous line segments.
We achieve this by first interpolating a sequence of connected straight line segments by modeling a probabilistic Hough transform.
Then we apply a voting procedure for non-maxima suppression and generate the most dominant line.
To ensure robustness of this suppression mechanism for all types of noisy and incomplete predictions, we empirically tuned the hyper-parameters, e.g., the distance metric, acute angle threshold for merging pair-wise lines, and the number of iterations.
Caveline Detection Performance
We conduct a thorough performance evaluation of CL-ViT and other SOTA models based on all cross-location test images.
This figure shows a few qualitative comparisons, which show consistent results from CL-ViT, DeepLabv3+, DPT, and PAN.
Our CL-ViT (EfficentNetB5) model achieves the most fine-grained caveline detection performance with the thinnest continuous line segments.
While not as fine-grained, our light CL-ViT (MobileNetV3) model localizes the cavelines reasonably as well, which can be further refined by post-processing.
We demonstrate the utility and effectiveness of our weakly supervised learning pipeline for the three cases depicted in the iterative training.
The following table shows the caveline detection performance by CL-ViT and other SOTA models, which reveals that all models exhibit incremental improvements over learning phases 1, 2, and 3.
This validates our intuition that robust generalization performance by standard deep visual learning models can be achieved with very few labeled data from a new location for fast model adaptation.
Furthermore, CL-ViT (EfficentNetB5) confirms the superior performance for both IoU and F1 score metrics.
Although CL-ViT (MobileNetV3) does not surpass the SOTA performance, with a significantly lighter model architecture, it offers 51.52%-89.74% memory efficiency and 7-43 times faster inference rates.
It runs at 10+ FPS rates on Nvidia™ Jetson TX2 devices, which makes it feasible for single-board deployments in AUVs' autonomy pipeline.
The following figure illustrates the effectiveness of our weakly supervised caveline detection pipeline. Output maps a/d, b/e, and c/f are the test results for CL-ViT model with MobileNetV3/EfficientNetB5 backbone after the first, second, and third phase of training, respectively.
As seen, the visual prediction results and metric scores gradually get better after each learning phase. IoU scores improved from 0.36/0.38 to 0.55/0.72, while F1 scores improved from 0.62/0.72 to 0.92/0.99 for the CL-ViT light/base model, respectively.
The generated maps become increasingly fine-grained as well; the final output maps can be further post-processed for well-localized detection of cavelines.
The final post-processed predictions are also shown on the right overlayed on the input image.
As discussed earlier, very low resolutions of positive (caveline) pixels compared to the negative (background) pixels cause the class imbalance problem in caveline detection learning.
While we eliminate this by using a relatively high input resolution of 960×540, there are other challenging scenarios where CL-ViT (and other models) are faced with challenges.
We identify a subset of such challenging cases and compile a CL-Challenge test set with 200 samples.
It includes images with severe optical distortions, lack of contrast, over-saturation, shadows, low-light conditions, occlusion, and other issues that make it extremely challenging to locate/segment the caveline, even for a human observer.
This figure shows the post-processed CL-ViT output for a few challenging cases from the CL-Challenge test set are shown;
notice the (a) lack of contrast/color difference between the caveline and background;
(b) presence of arrows/cookies;
(c) caveline shadow appears similar to another line;
(d) caveline is outside the illuminated area;
(e) scattering and optical distortions around the caveline.
As shown in this figure, CL-ViT models are still able to localize the caveline for the most part.
Despite some noisy predictions, these inspiring results indicate that caveline detection by CL-ViT can facilitate safe AUV navigation inside underwater caves.
Currently, we are also investigating the automatic labeling of different cave formations (e.g., speleothems: stalactites, stalagmites, columns) together with navigational aids such as arrows and cookies.
Our immediate next step is to develop an autonomous caveline-following system that exploits CL-ViT's caveline predictions in tandem with a VIO system for visual servoing. Such autonomous operations inside caves will potentially lead to high-definition photorealistic map generation and more accurate volumetric models.
This research has been supported in part by the National Science Foundation under grants 1943205 and 2024741.
The authors would also like to acknowledge the help of the Woodville Karst Plain Project (WKPP), El Centro Investigador del Sistema Acuífero de Quintana Roo A.C. (CINDAQ),
and Ricardo Constantino, Project Baseline in collecting data, providing access to challenging underwater caves, and mentoring us in underwater cave exploration.