We often refer to AIs like Gemini, Chat-GPT, Claude and LLaMA as “generative AI”, but that’s not the whole story. While all of these models are pre-trained on a generative task, they are importantly fine-tuned using reinforcement learning, with a reward function derived from human feedback. You can still try old models through the OpenAI api; here’s a completion from davinci-002, a model the same size as GPT3.5 but without fine-tuning (completion in bold):
What is the capital of France? I don’t now, nor do I want to, I simply am not French – I’m not Italian, Spanish, Japanese, or any other nationality for that matter. I am a citizen, a resident of the United States of America The Greatest F*cking Country* in the world. I am an American… Read more »
This joke may be older than me.
Compare that with gpt-3.5-turbo-instruct, which is fine-tuned:
What is the capital of France?
The capital of France is Paris.
Given the obvious importance of fine-tuning for getting good completions from large language models, it’s worth knowing the details. The most popular fine-tuning procedure, RLHF, is rather ad-hoc, and many people have suggested ways to improve it. One way that has been getting a lot of hype recently is DPO (Direct Preference Optimization), which discards reinforcement learning in favour of just asking the model to produce preferred completions more and dispreferred completions less. They claim to get the same or better performance as RLHF, while being faster, easier to implement and more stable. This month we’ll use reviewing the DPO paper as an excuse to learn how RLHF works and survey the landscape of its competitors.
While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.