Audio samples from the two submitted configurations alongside the
challenge-provided FluxAudio-S baseline. Baseline is
the official 120 M FluxAudio-S checkpoint: text-conditioned on T5 +
LAION-CLAP-Music features but trained without our score-conditioning
head, expert iteration, CRPO pass, or inference post-processing.
Sub. 1 and Sub. 2 are the
full submitted pipeline (score-conditioned SFT + expert iteration +
v1→v2 cross-loading + CRPO, followed by 3×Demucs and
LUFS−16.5) at seeds 42 and 55 respectively. All audio
is mono, 44.1 kHz, 10 seconds.
CLAP-text is the cosine similarity between the prompt
and the generated audio in the LAION-CLAP space (higher is better).
Reward is the TuneJury preference scalar trained on
four open music-preference corpora (higher is better; typical range
approximately −2 to +2).
Prompts (subset with computed scores) where our submission improved
the TuneJury reward most over the baseline. For each
case we show the best of Sub. 1 / Sub. 2 on the right; the
Δreward badge is our submission's reward minus
the baseline's.
All 100 evaluation prompts with all three systems side by side.
CLAP / Reward scores shown for the 10-prompt scored subset; remaining
prompts list audio only.