Demystifying Reward Models in RLHF: A Comprehensive Guide

Introduction:

In the ever-expanding universe of Reinforcement Learning from Human Feedback (RLHF), the role of reward models is nothing short of paramount. These models serve as the cornerstone for fine-tuning Large Language Models (LLMs) to align with human values. In this technical blog, we'll delve into the intricacies of reward models, covering three vital aspects:

1. Creating a Dataset for Training Reward Models

Step 1: Selecting the Right Instruct Fine-Tuned Model

The journey of building a reward model dataset begins with the selection of an instruct fine-tuned LLM. This model will act as the foundation upon which you'll construct the reward model dataset.

Step 2: Assembling a Prompt Dataset

To train a reward model, you require a prompt dataset. This dataset comprises numerous prompt samples, each serving as a guide for generating completions.

Step 3: Generating Completions

Taking the prompt dataset, you pass it through the instruct fine-tuned LLM. This crucial step yields multiple completions for each prompt, creating a diverse set of potential responses.

Step 4: Soliciting Human Feedback

To assess the quality of the completions, human evaluators step in. They rank the completions based on predefined model alignment criteria, ensuring a consistent and objective evaluation process.

Step 5: Reducing Bias and Errors

To minimize bias and errors in the alignment process, consider involving multiple evaluators to assess the same data. This practice leads to more robust and well-rounded evaluations by averaging the assessments of different individuals.

Step 6: Converting Rankings into Pairwise Training Data

Before feeding the data into the reward model, it's essential to convert rankings into pairwise training data. This critical step prepares the training dataset that will help the model learn effectively.

You take ranked completions and combine them into the matrix, where m is a number of completions for the single prompt. Then you re-order completions so that the reward of $1$ goes first for each pair. In the end, you got the following data pair of data:

Prompt
3 ranked completions
$y_j$ is always the preferred completion so it is sorted to be always the first one.

2. Understanding the Reward Model

A reward model is a fundamental component of the RLHF ecosystem. It is generally a Large Language Model like BERT. This works as a binary classifier which produces probability via a softmax activation function.

3. Integrating the Reward Model into the RLHF System

The RLHF Process

To fine-tune an LLM using a reward model in RLHF, follow these steps:

Begin with an instructed LLM that roughly fits your task.
Pass the prompt dataset to the LLM.
Set the prompt and completion as input to the reward model, creating a ${prompt, completion}$ pair.
The reward model returns a probability score based on predefined criteria, indicating the alignment of the completion with human values.
Higher scores signify a more aligned completion, while lower scores indicate less alignment.
Pass the evaluation to an RL algorithm and it will update the weights of LLM which will convert it into RL-updated LLM
Iterate through this process a predefined number of times.

This iterative process results in a "human-aligned LLM" – a model that evolves from the instructed LLM, to an RL-updated LLM, and finally to a human-aligned LLM, finely tuned to meet human expectations and values.

In Conclusion

Reward models are the linchpin of RLHF, guiding LLMs to align with human values and produce responses that resonate with human expectations. By creating datasets, understanding the model, and integrating it into the RLHF system, you unlock the potential to build LLMs that truly understand and align with human intentions. The journey towards human-aligned LLMs continues to evolve, and the power of reward models is instrumental in shaping this future.