UStyle: Waterbody Style Transfer of Underwater Scenes by Depth-Guided Feature Synthesis

Md Abu Bakr Siddique, Vaishnav Ramesh, Junliang Liu, Piyush Singh, Md Jahidul Islam

Pre-print UStyle Code UF7D Dataset

Overview

UStyle is the first data-driven framework for underwater image style transfer, specifically designed to transform waterbody styles while preserving the scene's structural integrity. Traditional neural style transfer methods, optimized for terrestrial imagery, struggle with underwater distortions like light absorption and scattering, often resulting in unnatural adaptations. To overcome these challenges, UStyle employs a ResNet-based encoder-decoder with hierarchical skip connections and a progressive multi-stage training strategy paired with domain-specific loss functions. A novel depth-guided whitening and coloring transform (DA-WCT) mechanism synthesizes the waterbody style from a reference image without transferring unwanted scene details. Complementing the model is the UF7D dataset—a collection of 4,050 high-resolution underwater images with paired depth annotations across seven waterbody styles. Extensive evaluations demonstrate that UStyle outperforms existing approaches, making it a promising tool for data augmentation, robotic vision, and marine imaging applications.

The figure above shows some visualizations of the underwater image STx capabilities of UStyle. The content image (top left) is transformed into different waterbody styles. For any given style image, note that its waterbody characteristics are transferred to the content scene, while its object geometry and structural details of the content image remain preserved. Such no-reference waterbody fusion is useful as a geometry-preserving data augmentation tool in underwater imaging, photometry, and robotics applications.

UStyle Training and Ingerence Pipeline

UStyle is a novel waterbody style transfer framework that integrates a deep encoder-decoder architecture, progressive blockwise training, and a depth-aware stylization module. Given a content image and a style image, UStyle generates a stylized output by fusing waterbody characteristics from the style image while preserving the scene geometry and depth cues of the content image. The depth maps guide the extraction of waterbody details and ensure that structural consistency is maintained throughout the transfer. At its core, UStyle employs a ResNet50 backbone that extracts multi-scale feature representations through sequential encoder blocks and symmetric skip connections. These connections preserve high-frequency details and spatial structures, allowing for effective fusion of content and style at multiple resolutions. As shown above, the framework is trained progressively in a blockwise manner. Initially, the encoder-decoder network is pre-trained on a large-scale atmospheric dataset using reconstruction, feature, and perceptual losses. It is subsequently fine-tuned on an underwater dataset with additional domain-specific losses—such as SSIM, color consistency, FFT, and CLIP-based semantic alignment—to ensure robust feature learning and stable coarse-to-fine stylization.

The depth-aware stylization module, known as DA-WCT, extends traditional whitening and coloring transforms by incorporating a physics-guided waterbody estimation. By computing a depth weight map via a sigmoid function, the module adaptively blends the stylized features with the original content features. This approach favors stylization in regions corresponding to farther objects while preserving finer details in closer regions. Multi-scale fusion and a subsequent guided filter further refine the output for enhanced visual consistency. During inference (see Figure below), content and style images along with their depth maps are processed through a ResNet-based encoder to extract multi-scale features. These features are aligned by the depth-aware whitening and coloring transform (DA-WCT) module, then progressively refined by a multi-stage decoder. Finally, a guided filter is applied to yield a high-fidelity stylized output. This comprehensive pipeline enables UStyle to achieve realistic underwater style transfer suitable for applications in robotic vision, data augmentation, and marine imaging.

UF7D Dataset

To enable data-driven learning of no-reference waterbody style transfer (STx), we have compiled a comprehensive high-resolution underwater image database named UF7D. This dataset is categorized into seven style groups: Blue (B), Blue-Green (B-G), Clear (C), Deep Blue (DB), Deep Green (DG), Green (G), and Green-Blue (G-B). A few samples from each style category are illustrated in the main image below. qA key feature of UF7D is that each image is accompanied by a corresponding scene depth map, generated using the DepthAnything V2 model. With its high-resolution content and depth information, UF7D supports advanced visual learning tasks such as underwater image style transfer, enhancement, and object detection. Our data collection focused on several critical aspects: (1) Scene Diversity; (2) Image Quality; and (3) Balanced STx Categories. Additionally, our statistical analyses reveal comprehensive coverage of varying underwater depths. Distinct depth distributions and clear clustering of global and local color characteristics validate our human-perceived style categorization, suggesting that the underlying style attributes can be effectively learned with tailored training strategies.

PCA Projection of Global Color Characteristics — **PCA Projection:** Global color characteristics with 92% variance captured in the first two components.

t-SNE Visualization of Local Color Statistics — **t-SNE Visualization:** Clustering of local color statistics, validating our style categorization.

Additional analyses include violin plots that illustrate depth distributions across style categories. The PCA and t-SNE visualizations reveal distinct clustering of global and local color features, respectively. These results confirm that our style categorization is both perceptually meaningful and statistically robust, supporting the development of tailored training strategies for effective underwater style transfer.

Experimental Results and Discussion

We evaluate UStyle against six state-of-the-art underwater style transfer methods – including artistic models (StyTR-2, ASTMAN, MCCNet, IEContraAST), a photorealistic method (PhotoWCT2), and the physics-based AquaFuse – using four cross-domain transfer cases (B→G, G→B, DB→DG, and DG→DB) over 2,500 test scenes from the UF7D dataset. All baselines were fine-tuned on UF7D under consistent training conditions.

Qualitative Results: As illustrated in Qualitative Comparison, conventional artistic methods often produce artifacts or oversaturate colors, while photorealistic approaches struggle with deep-water transformations. In contrast, UStyle, with its depth-aware stylization, consistently generates visually coherent and artifact-free outputs that preserve structural and color fidelity across varying underwater conditions.

Qualitative Comparison of Underwater STx Results — **Qualitative Comparison** – Side-by-side examples of underwater style transfer results from various methods, including UStyle.

Quantitative Results: We measure performance using RMSE, PSNR, SSIM, GMSD, and LPIPS metrics. UStyle consistently achieves lower RMSE, higher PSNR and SSIM, and superior perceptual fidelity (i.e., lower GMSD and LPIPS) compared to the baselines. As detailed in the plots below, UStyle outperforms competing methods by a significant margin, even without requiring explicit reference images or prior scene information.

**RMSE Performance Plot** – Average RMSE scores for different models.

**PSNR-SSIM Performance Plot** – A 2D grid visualization of PSNR and SSIM scores, highlighting UStyle’s superior performance.

Ablation Studies

We conducted ablation experiments to assess the contributions of depth supervision and different loss components in UStyle. The results indicate that incorporating depth guidance via the DA-WCT module significantly improves structural preservation and color adaptation. With the proposed DA-WCT blending, stylized outputs exhibit consistent blending for both foreground objects and background waterbody regions. In contrast, we observe a global averaging effect when using only WCT, along with over-saturated pixels for deep-water (DB/DG) styles.

Ablation Results: Impact of Depth Guidance

Ablation Study Metrics — **Ablation experiments** - comparing the performance of three variants: UStyle without clip loss (w/o CLIP), UStyle without additional regularization losses L3 (w/o L3), and UStyle with all loss components. Each of these variants is evaluated with WCT/DA-WCT on PSNR, SSIM, and RMSE metrics.

Please refer to our Paper for more detailed experimental analyses!

Acknowledgments

This work is supported in part by the National Science Foundation (NSF) grant #2330416 and the University of Florida (UF) research grant #132763.