Perceptually coherent sound-space traversal for interactive systems via embeddings, VAE priors and diffusion decoding

Sci Rep. 2026 Jul 1. doi: 10.1038/s41598-026-60196-4. Online ahead of print.

ABSTRACT

Large audio archives contain rich and diverse sonic material, yet they are seldom usable as controllable media in interactive contexts such as installations, live performance and adaptive sound environments. This paper presents a framework for interactive latent audio synthesis and technically continuous sound-space traversal and synthesis within a structured latent manifold rather than unconstrained audio generation. The framework first uses pretrained audio encoders, including AudioMAE, CLAP and related models, to organize a curated 120,000-clip AudioSet subset into a structured audio embedding space. A variational autoencoder then learns a smooth latent representation, which is further refined by a latent diffusion model to improve latent validity and traversal continuity. The refined latent codes are rendered into controllable waveforms through a DDSP-based synthesis stage, while Ambisonic spatialization provides Ambisonic spatial rendering coupled to traversal parameters. Gesture is used only at inference time as a control layer for traversal and spatial modulation, rather than as a training condition for the generator. The framework is evaluated against VAE-only and diffusion-only baselines using latent-structure analysis, interpolation behavior, synthesis quality and runtime performance. Results show that the proposed hybrid model achieves a CLAP similarity of 0.82, a mean F0 error of 245.3 Hz, a spectral convergence of 0.132 and an interactive latency of approximately 35 ms. These findings provide technical and proxy-based evidence for latent continuity, synthesis stability and real-time traversal feasibility. These findings provide technical and proxy-based evidence for latent continuity, synthesis stability and real-time traversal feasibility, while the human-centered pilot evaluation provides initial user-level evidence for perceived traversal smoothness, controllability, responsiveness and creative usefulness. Because the pilot evaluation is small-scale, these user-facing findings should be interpreted as preliminary rather than as statistically generalizable validation.

PMID:42387044 | DOI:10.1038/s41598-026-60196-4

By Nevin Manimala