What is reinforcement learning from human feedback (RLHF)? – TechTalks

Reinforcement learning from human feedback

This article is part of Demystifying Artificial Intelligence, a series of posts that (attempt to) demystify the terms and myths surrounding AI.

Since OpenAI released ChatGPT, there has been a lot of excitement about developments in Language Large Models (LLM). While ChatGPT is about the same size as other LLMs, its performance is much better. It already promises to enable new apps or disable old ones.

One of the main reasons behind ChatGPT’s amazing performance is its training method: reinforcement learning from human feedback (RLHF). While it has shown great results with LLMs, the history of RLHF dates back to the days before the first GPT was released. It was not its first application of natural language processing.

Here’s what you need to know about RLHF and how it applies to large language paradigms.

What is RLHF?

Reinforcement learning is a field of machine learning in which an agent learns a policy through interactions with its environment. agent takes procedures (Which can include doing nothing at all.) These actions affect Environment proxy, which in turn goes to a file condition A returns prize. Rewards are the feedback signals that enable the RL agent to adjust his work Policy. Where the agent goes through training episodesshe adjusts her policy to take consecutive actions that increase her reward.

Learning reinforcement

Designing the right reward system is one of the main challenges of reinforcement learning. In some applications, the reward is delayed long into the future. Think of an RL agent playing chess. He gets a positive bonus only after he beats his opponent, who can perform dozens of moves. In this case, the dealer will waste much of his initial training time making random moves until he accidentally stumbles upon a winning combination. In other implementations, reward can’t even be determined by a mathematical or logical formula (more on this when we talk about language models).

Reinforcement learning from human feedback enhances RL agent training by including humans in the training process. This helps account for the immeasurable items in the reward system.

Why don’t we always do RLHF? For one thing, it scales badly. An important advantage of machine learning in general is its ability to scale as computational resources become available. As computers grow faster and data become more available, you can train larger machine learning models at faster rates. Reliance on humans to train RL systems becomes a bottleneck.

Therefore most RLHF systems rely on a combination of automated and human-provided reward signals. An algorithmic reward system provides key feedback to the RL agent. The human supervisor either helps by providing an additional reward/punishment signal occasionally or the data needed to train the reward model.

An example of reinforcement learning from human feedback (Image credit: cs.utexas.edu)

Let’s say you want to build a pizza cooking robot. You can incorporate some measurable items into the automated reward system (eg thickness of crust, amount of sauce and cheese, etc.). But to make sure the pizza is delicious, you have a human taste and record the pizza the robot cooks during training.

Language as a reinforcement learning problem

Large language paradigms have proven to be very good at various tasks, including text summary, answering questions, text generation, code generation, protein folding, and more. On very large scales, LLMs can do with zero and little learning, and accomplish tasks for which they are not trained. One of the great achievements of the adapter model, the architecture used in LLMs, is its ability to train through unsupervised learning.

However, despite their impressive achievements, LLM shares a fundamental property with other machine learning models. In essence, they are very large prediction machines designed to guess the next token in a sequence (the vector). Trained on a very large set of texts, the LLM develops a mathematical model that can produce long stretches of text that are (mostly) coherent and consistent.

The big challenge of the language is that in many cases, there are many correct answers to the prompt. But not all of them are desirable depending on the user, the application, and the context of the LLM. Unfortunately, unsupervised learning on a large script does not keep up with the paradigm with all the different applications in which it will be used.

Reinforcement learning for large language models

Fortunately, reinforcement learning can help point an LLM in the right direction. But first, let’s define language as an RL problem:

agent: The language model is the agent of reinforcement learning and has to learn how to generate perfect text output.

Workspace: workspace is the set of possible language outputs that the LLM can generate (and it is very large).

state space: environment state includes user prompts and LLM outputs (also too much space).

prizeReward measures the alignment of the LLM response with application context and user intent.

All of the elements in the RL system above are trivial except for the bonus system. Unlike chess, Go, or even robotics problems, the language model’s reward rules are not well defined. Fortunately, with the help of reinforcement learning from human feedback, we can create a good reward system for our language model.

RLHF for Language Paradigms

The RLHF for language models consists of three phases. First, we start with a pre-trained language model. This is very important because LLM requires a large amount of training data. Training them from the start on human reactions is almost impossible. An LLM pre-trained through unsupervised learning will already have a robust language model and will create coherent outputs, although some or many of them may not be in line with the users’ goals and intentions.

In the second stage, we create a file Bonus model for the RL system. At this point, we train another machine learning model that takes the text generated by the main model and produces a quality score. This second form is usually another LLM modified to output an integer value instead of a string of text tokens.

To train the reward model, we must create a dataset of LLM-generated transcripts labeled with quality. To compose each training example, we give the main LLM a prompt and have it generate several outputs. We then ask the human raters to rank the resulting text from best to worst. We then train the reward model to predict the outcome from the LLM script. By training on LLM outputs and ranking scores, the reward model creates a mathematical representation of human preferences.

RLHF LLM Bonus Form

In the final stage, we create the reinforcement learning loop. The master LLM copy becomes an RL agent. In each training episode, the LLM takes several prompts from a training dataset and generates a script. Its output is then passed to the reward model, which provides a score that rates its alignment with human preferences. The LLM is then updated to generate higher outputs in the reward model.

While this is RLHF’s general framework for language models, different implementations make modifications. For example, because a major LLM update is so expensive, machine learning teams sometimes freeze many of their layers to reduce training costs.

Another consideration in the RLHF for language paradigms is maintaining a balance between reward optimization and language consistency. The reward model is an imperfect approximation of human preferences. Like most RL systems, the LLM agent may find a shortcut to increasing rewards with a violation of syntactic or logical consistency. To prevent this, the ML engineering team keeps a copy of the original LLM in the RL loop. The difference between the original and RL-trained LLM output (also called the KL difference) is incorporated into the reward signal as a negative value to prevent the model from drifting too much from the original output.

How ChatGPT uses RLHF

OpenAI hasn’t released the technical details of ChatGPT (yet). But a lot can be learned from the ChatGPT blog post and the details on InstructGPT, which also uses RLHF.

ChatGPT uses the generic RLHF framework we described above, with some modifications. In the first phase, engineers performed “supervised fine-tuning” on a pre-trained GPT-3.5 model. They hired a group of human writers and asked them to write answers to a set of prompts. They used the data set of quick response pairs to refine the LLM. OpenAI reportedly spent a large amount on this data, which is partly why ChatGPT is superior to other similar LLMs.

In the second stage, they created their reward model based on the standard procedure, generated multiple responses to the prompts and rated them by human commentators.

In the final stage, they used the RL algorithm for close policy optimization (PPO) to train the main LLM. OpenAI doesn’t provide further details on whether it will freeze any parts of the model or how it can ensure that the RL-trained model doesn’t deviate too much from the original distribution.

ChatGPT training process (source: OpenAI)

RLHF Limits to Language Paradigms

While RLHF is a very effective technique, it has several limitations. Human work always becomes a bottleneck in machine learning pipelines. Manual labeling of data is slow and expensive, which is why unsupervised learning has always been a long-sought goal of machine learning researchers.

In some cases, you can get free stickers from users of your ML systems. This is what the upvote/downvote buttons you see in ChatGPT and other similar LLM interfaces are for. Another technique is to obtain classified data from online forums and social networks. For example, many Reddit posts are shaped like questions and the best answers receive higher votes. However, these data sets still need to be cleaned and reviewed, which is expensive and slow. And there is no guarantee that the data you need is available in a single online source.

Big tech companies and aggressively funded labs like OpenAI and DeepMind can spend huge sums creating private RLHF datasets. But smaller companies will have to rely on open source datasets and web scraping technologies.

RLHF isn’t a perfect solution either. Human feedback can help steer LLM away from generating harmful or false results. But human preferences are not well defined, and you can never create a reward model that matches the preferences and norms of all societies and social structures.

However, the RLHF provides a framework for optimizing LLM alignment with humans. So far, we’ve seen RLHF work with general purpose models like ChatGPT. I believe RLHF will become a very effective technology for optimizing smaller LLMs for specific applications.

Leave a Comment