Improving Text-to-Music Generation with Human Preference Rewards

Audio demo for our submission to the ICME 2026 ATTM Grand Challenge (Efficiency Track). Code: github.com/yonghyunk1m/ttm-humanpref.

Audio samples from the two submitted configurations alongside the challenge-provided FluxAudio-S baseline. Baseline is the official 120 M FluxAudio-S checkpoint: text-conditioned on T5 + LAION-CLAP-Music features but trained without our score-conditioning head, expert iteration, CRPO pass, or inference post-processing. Sub. 1 and Sub. 2 are the full submitted pipeline (score-conditioned SFT + expert iteration + v1→v2 cross-loading + CRPO, followed by 3×Demucs and LUFS−16.5) at seeds 42 and 55 respectively. All audio is mono, 44.1 kHz, 10 seconds.

CLAP-text is the cosine similarity between the prompt and the generated audio in the LAION-CLAP space (higher is better). Reward is the TuneJury preference scalar trained on four open music-preference corpora (higher is better; typical range approximately −2 to +2).

All 100 evaluation prompts with all three systems side by side. CLAP / Reward scores shown for the 10-prompt scored subset; remaining prompts list audio only.

# Prompt Baseline Sub. 1 (seed 42) Sub. 2 (seed 55)