arXiv AI recent: Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation
The article introduces Rubric-Conditioned Self-Distillation, a framework for post-training reasoning language models using structured rubrics as feedback.,The method conditions a teacher...
Post-training of reasoning language models typically uses supervised distillation or reinforcement learning with scalar rewards, both of which have limitations: distillation relies on potentially noisy chain-of-thought annotations, while scalar rewards obscure specific areas for improvement.,Rubr...