Inference-time Scaling for Diffusion-based Audio Super-resolution
Yizhu Jin1, Zhen Ye1, Zeyue Tian1, Haohe Liu2, Qiuqiang Kong3, Yike Guo1, Wei Xue1
Hong Kong University of Science and Technology1
2University of Surrey
3Chinese University of Hong Kong
Abstract
Diffusion models have demonstrated remarkable success in generative tasks, including audio super-resolution (SR). In many applications like movie post-production and album mastering, substantial computational budgets are available for achieving superior audio quality. However, while existing diffusion approaches typically increase sampling steps to improve quality, the performance remains fundamentally limited by the stochastic nature of the sampling process, leading to high-variance and quality-limited outputs. Here, rather than simply increasing the number of sampling steps, we propose a different paradigm through inference-time scaling for SR, which explores multiple solution trajectories during the sampling process. Different task-specific verifiers are developed, and two search algorithms, including the random search and zero-order search for SR, are introduced. By actively guiding the exploration of the high-dimensional solution space through verifier-algorithm combinations, we enable more robust and higher-quality outputs. Through extensive validation across diverse audio domains (speech, music, sound effects) and frequency ranges, we demonstrate consistent performance gains, achieving improvements of up to 9.70% in aesthetics, 5.88% in speaker similarity, 15.20% in word error rate, and 46.98% in spectral distance for speech SR from 4 kHz to 24 kHz, showcasing the effectiveness of our approach.
Figure 1: Performance improvements over the LR input across different audio types using inference-time Random Search with various verifiers from 4 kHz to 24 kHz. The enhancement denotes the relative improvement over LR under each evaluation metric. For Aesthetics (AES), Speaker Similarity (SpkSim) and CLAP Score, enhancements reflect relative increases. For Word Error Rate (WER) and Log Spectrogram Distance (LSD), improvements are computed as relative reductions.
Table 1: Relative performance improvements over the default generation (vanilla sampling) for speech from 4 kHz to 24 kHz in Random Search, demonstrating the effect of inference-time scaling across different verifier types. LSD, AES, SpkSim, and WER refer to Log Spectrogram Distance, Aesthetics Score, Speaker Similarity, and Word Error Rate, respectively. The Ensemble Verifier aggregates AES, SpkSim, and WER by averaging their rank scores.
Audio Super-Resolution Results Comparison
Speech:
4 kHz Sample | Default Generation | Selected | Ground-Truth |
|
|
|
|
|
|
|
|
Music:
4 kHz Sample | Default Generation | Selected | Ground-Truth |
|
|
|
|
|
|
|
|
Sound Effect:
4 kHz Sample | Default Generation | Selected | Ground-Truth |
|
|
|
|
|
|
|
|
More demos are on their way. Stay tuned for more updates.