Talk Title: Investigating Length Correlations in RLHF

Abstract: Reinforcement Learning with Human Feedback (RLHF) has reported great success in aligning language models, but often drives models to produce longer outputs. We investigate this phenomenon and find that response length is a more significant contributing factor behind RLHF’s reported improvements than previously thought. On three diverse data settings, we find that performance improvements after RLHF are largely due to increased length, instead of other important features. In fact, optimizing a purely length-based reward reproduces most downstream RLHF improvements over fine-tuned models. We test a comprehensive set of length-countering interventions, and identify reward models as the dominance source of this bias. 

Bio: Tanya Goyal is an assistant professor in the Computer Science department at Cornell University. Her research interests include building reliable and sustainable evaluation frameworks for large language models (LLMs) as well as understanding LLM behaviors as a function of training data and/or alignment strategies. Previously, she was a postdoctoral scholar at Princeton Language and Intelligence Center (2023-2024). Tanya completed her Ph.D. in Computer Science at UT Austin in 2023, advised by Greg Durrett, and her thesis was awarded UTCS’s Bert Kay Dissertation award.