TuneJury logo

TuneJury

An Open Metric for Improving Music Generation Preference Alignment
Yonghyun Kim · Junwon Lee♭♭ · Haiwen Xia♮♮ · Yinghao Ma♯♯ · Junghyun Koo · Koichi Saito · Yuki Mitsufuji · Chris Donahue
Carnegie Mellon University · Sony AI · Georgia Tech · ♭♭KAIST · ♮♮Peking University · ♯♯Queen Mary University of London
Preprint · arXiv:2606.17006

Listen first: one example per reward-driven mode

SDD-100 is our 100-prompt subset of the Song Describer Dataset, used for all three modes.

Mode 1: best-of-N selection (MusicGen-medium, N=16)

SDD-100 prompt 75: "A dark trance track featuring accordion, blending hypnotic rhythms with melancholic melodies and a pervasive, atmospheric mood." Only the noise seed differs between candidates.

N=1 (single sample, c0)
TuneJury reward: +0.05
Best-of-16 (TuneJury-selected, c6)
TuneJury reward: +1.71 (Δ +1.66)

Mode 2: DITTO latent optimization (TangoFlux)

SDD-100 prompt 0: "A melancholic rap piece driven by a steady drummachine beat, layered with subtle synth pads and a sparse electric guitar." Same backbone, same prompt; only the initial noise latent is optimized against the negative TuneJury reward (5 outer iterations, 8 inner steps, lr 0.05).

Baseline (N=1, no optimization)
TuneJury reward: −1.11
DITTO-optimized (5 outer iter)
TuneJury reward: +1.13 (Δ +2.24)

Mode 3: expert-iteration post-training (FluxAudio-S)

SDD-100 prompt 46: "A fast garage track featuring an electric guitar, driven by raw energy and a loose, rhythmic feel." Same backbone, same seed, same MeanAudio inference settings. Only the post-training (expert iteration on TuneJury's top-decile candidates) differs.

Baseline FluxAudio-S
TuneJury reward: −2.06
Post-trained (expert iteration)
TuneJury reward: −0.05 (Δ +2.00)

More examples (all four Mode 1 backbones, two Mode 2 backbones, and top-3 Mode 3 prompts) in the listening review demo.

What is TuneJury?

TuneJury is an open instance-level pairwise reward model for text-to-music generation. It predicts a single music-preference scalar from a (text, audio) input, trained on ~17.5K human-rated A vs. B preference pairs (~22K total pool including validation and held-out test) from Music Arena, MusicPrefs, AIME, and SongEval. No pseudo-label augmentation.

2.8 M
Trainable parameters
17.5 K
Human-rated training pairs (~22K total pool)
0.7086
Pairwise accuracy on held-out test (ECE 0.0339)
10 min
Full training on a single RTX A5000

Why TuneJury? Design comparison with prior music reward / quality scorers

TuneJury and CMI-RM share the RankNet pairwise paradigm. They differ on every other axis (paper §1, Table 1; T=text, L=lyrics, R=reference audio, A=candidate audio).

Model Framework Input Output Supervision
TuneJury (ours)RankNet pairwiseTA1-d scalar~17.5K human-rated pairs
CMI-RMRankNet pairwiseTLRA2-d (align, qual)~6.6K human + ~110K pseudo
SongEval-RMMOS regressionA5-d aestheticSongEval 5-axis MOS
Audiobox-AestheticsMOS regressionA4-d aestheticAudiobox MOS
MuQ-EvalMOS regressionA2-d (align, qual)MusicEval per-clip MOS
PAM scoreZero-shot audio-LMTA1-d scalarzero-shot

On the OOD PAM and CMI-RewardBench Music Arena splits, TuneJury stays within 0.06 SRCC and 2 pp pairwise accuracy of the pseudo-augmented CMI-RM, despite 10× fewer head parameters and no pseudo-label augmentation.

Architecture

TuneJury architecture diagram

Single MLP head over a frozen encoder stack, trained with the shared-weight pairwise logistic loss L = −log σ(s(A) − s(B)). Two encoder instantiations are released:

Three downstream applications

Three downstream applications: Mode 1 best-of-N selection, Mode 2 DITTO latent optimization, Mode 3 expert-iteration post-training

All three downstream applications share the same frozen TuneJury reward signal (blue). Mode 1 ranks frozen-backbone candidates by reward (gray = frozen). Mode 2 backpropagates reward into the noise latents at inference. Mode 3 fine-tunes the backbone (red = trainable) on its own top-reward decile.

The same frozen TuneJury reward signal drives three downstream applications with no further preference labeling:

Mode 1 Inference-time best-of-N selection. For each prompt, the backbone draws N candidate clips; the top-1 by TuneJury reward is kept. Reward stays strictly monotone in N up to N=32 on four frozen backbones (MusicGen-medium, MusicGen-large, AudioLDM2-music, ACE-Step Turbo Continuous).

Mode 2 Inference-time latent optimization (DITTO-style). Lifts mean reward by +0.25 on SAO-small (n=30, 19/30 prompts improved) and +1.56 on TangoFlux (n=100, 100/100 improved). AudioLDM2-music omitted (50-step backprop exceeds memory budget).

Mode 3 Expert-iteration post-training. +0.416 mean reward lift on FluxAudio-S at LR=1e-5 (75/100 prompts improved), mapping a reward-fidelity Pareto frontier across LR ∈ {1e-6, 5e-6, 1e-5}.

Mode 1 best-of-N sweep across N ∈ {1, 2, 4, 8, 16, 32} on four frozen open-weights backbones

Mode 1 best-of-N sweep on four frozen open-weights backbones. (a) TuneJury reward monotone through N=32. (b) CLAP score (text-audio cosine) rises in parallel through N=8 then plateaus. (c) FAD-CLAP against SDD-706 improves at N=4 on three of four backbones (paper §5.1, Figure 3).

Anchor calibration for post-cutoff systems

Bradley–Terry-based post-hoc per-system calibration. Recovers ~5 pp pairwise agreement on post-cutoff Music Arena battles with ~100 calibration pairs (~3 pp at K=30), at substantially better data efficiency than retraining over the swept K-grid (anchor K=10 matches retrain K=250, paper §A.D).

Quick start

git clone https://github.com/yonghyunk1m/TuneJury.git
cd TuneJury
pip install -r requirements.txt

python -c "
from tunejury import Scorer
scorer = Scorer.from_pretrained('checkpoints/tunejury.pt')
print(scorer.score('your_audio.wav', prompt='your caption'))
print(scorer.score('your_audio.wav', prompt=''))   # empty prompt -> OOD-safe (paper 4.2)
"

Full README, evaluation harness, and three-mode application code at the repo. Released under CC-BY-NC 4.0 (tracking MERT-v1-330M's constraint).

References

Works referenced by this project page. Full bibliography is in the paper.

Frozen encoders. Wu et al., LAION-CLAP (ICASSP 2023) · Li et al., MERT (ICLR 2024) · Zhu et al., MuQ (arXiv 2025).

Reward / quality models compared. Ma et al., CMI-RewardBench / CMI-RM (arXiv 2026) · Yao et al., SongEval / SongEval-RM (arXiv 2025) · Tjandra et al., Audiobox-Aesthetics (arXiv 2025) · Zhu and Li, MuQ-Eval (arXiv 2026) · Deshmukh et al., PAM (Interspeech 2024).

Training data sources. Kim et al., Music Arena (NeurIPS Creative AI 2025) · Huang et al., MusicPrefs (ISMIR 2025) · Grötschla et al., AIME (ICASSP 2025) · Yao et al., SongEval (arXiv 2025).

Generation backbones (Mode 1 / 2 / 3). Copet et al., MusicGen (NeurIPS 2023) · Liu et al., AudioLDM 2 (TASLP 2024) · Gong et al., ACE-Step (arXiv 2025) · Novack et al., Stable Audio Open Small (SAO-small) (arXiv 2025) · Hung et al., TangoFlux (ICLR 2026) · Li et al., MeanAudio / FluxAudio-S (arXiv 2025).

Open-license reward-score collections. Bogdanov et al., MTG-Jamendo (ML4MD@ICML 2019) · Defferrard et al., FMA (ISMIR 2017) · Law et al., MagnaTagATune (ISMIR 2009) · Humphrey et al., OpenMIC-2018 (ISMIR 2018) · Agostinelli et al., MusicLM / MusicCaps (arXiv 2023) · Melechovsky et al., MidiCaps (ISMIR 2024) · Manco et al., Song Describer Dataset (SDD) (ML4Audio@NeurIPS 2023).

Methods. Burges et al., RankNet (ICML 2005) · Novack et al., DITTO (ICML 2024) · Anthony et al., Expert Iteration (NeurIPS 2017) · Singh et al., ReSTEM (TMLR 2024) · Bradley & Terry, Bradley–Terry (Biometrika 1952) · Gao et al., Scaling laws for reward-model overoptimization (ICML 2023).

Side metrics. Kilgour et al., FAD (Interspeech 2019) · Gui et al., FAD-X encoder variants (ICASSP 2024) · Huang et al., MAD (MAUVE-Audio) (ISMIR 2025) · Guo et al., Calibration / ECE (ICML 2017).

Citation

@misc{tunejury2026,
  title         = {TuneJury: An Open Metric for Improving Music Generation Preference Alignment},
  author        = {Kim, Yonghyun and Lee, Junwon and Xia, Haiwen and Ma, Yinghao and Koo, Junghyun and Saito, Koichi and Mitsufuji, Yuki and Donahue, Chris},
  year          = {2026},
  eprint        = {2606.17006},
  archivePrefix = {arXiv},
  primaryClass  = {cs.SD},
  url           = {https://arxiv.org/abs/2606.17006},
}

Acknowledgments

We thank the maintainers of LAION-CLAP, MERT-v1-330M, and MuQ-MuLan-large for releasing their pretrained encoders; the authors of Music Arena, MusicPrefs, AIME, and SongEval for the open preference and aesthetic-rating sources; and the developers of the backbone audio generators (MusicGen, AudioLDM2-music, ACE-Step Turbo Continuous, Stable Audio Open-small, TangoFlux, FluxAudio-S) used in the Mode 1–3 demonstrations.