See other notes on pdf on ereader
This paper is an overview of the problems that arise during RLHF. It classifies them based on where in the process the problems occur, and on how tractable they are to solve.
Paper highlights the limitations of RLHF and suggests that multiple layers are needed. Comes at it from a AI Alignment perspective.
Paper club discussion
Structure of Review
-
Intro: this looks at the problems in RLHF. I hope you will get a better understanding of RLHF from this, both current state of the art, and how it will improve in the future
-
Review: cover RLHF diagram on page 3, and background in section 2
-
Outline: the bullet points at the bottom of page 2
- Especially note the colored terms in bullet point 1, the two classification schemes they have
-
Sections:
-
Challenges with Obtaining Human Feedback (3.1)
- Misaligned humans (tractable)
- Good oversight is difficult (tractable, except that life is difficult, and humans can be tricked)
- Data quality (hard to get lots of good data)
- Limitations of feedback types (good review of the pros and cons of different types of feedback)
-
Challenges with the Reward Model (3.2)
- Problem misspecification (single scalar reward functions are inadequate)
- Reward Misgeneralization and Hacking
- Evaluating Reward Models
-
Evaluating Reward Models (3.3)
- Robust Reinforcement Learning is Difficult
- Policy Misgeneralization (model may misbehave when out of domain)
- Distributional Challenges (For example, if sounding confident and producing correct
answers are correlated in the base model, the reward model will learn that sounding confident is good and reinforce this in the policy)
- Challenges with Jointly Training the Reward Model and Policy (Joint training induces distribution shifts)
-
Incorporating RLHF into a Broader Framework for Safer AI (4)
- “Many of the open questions with RLHF involve the dynamics at play between humans and AI”
- “AI alignment must address not only individuals’ perspectives, but also the norms, expectations, and values of affected groups”
- “While RLHF seems to improve the average performance of a system, it is not clear what effects it has on worst-case behavior. It was not designed to make systems adversarially robust”
- Solutions

- Addressing Challenges with Human Feedback
- Strategies for getting better feedback
- Addressing Challenges with the Reward Model
- Tips for doing RL, like using multiple objectives
- Addressing Challenges with the Policy
- Such as aligning during pretraining, or creating a lot of good example data
- RLHF is Not All You Need: Complementary Strategies for Safety
- They recommend also doing interpretability work
-
Governance and Transparency
- They want AI companies to be transparent, as a policy preference

-
Discussion:
- General questions
- How RLHF works
- What we’re doing with RLHF
- Compare and contrast the 3 main parts of RLHF that they discuss
- Deeper dives into any particular sections
Notes

- Discusses problems that come up with RLHF
- Structured as a list of potential problems
- Classification schemes
- Solvable?
- “Tractable”
- “Fundamental”
- Location of challenge:
- Human feedback
- Reward model
- RL Policy
Questions