I’m presenting this one at the ‣ ‣ .
It turned out to actually be a pretty crap paper
Summary
- Start with strong conclusion
- Wants to improve on RLHF
- Instruction relabeling has not been applied in LLMs before
- Their approach is to rewrite their prompts, to go along with queries, with Offline Relabeling
Points
- It presents an alternative to RLHF
- “Hindsight Instruction Relabeling” is what they call their technique
- Sets up the problem as “Goal reaching”
- Two stage algorithm
- They present results, compare to RLHF and finetuning
- LLMs can have unexpected behavior when given instructions, such as making stuff up or being rude
- They want to be able to learn from both successes and failures
- They relabel the instructions based on the output
- Two phases
- First, generate outputs
- Second, relabel instructions
- It alternates between these two phases until convergence
- Like in Algorithm Distillation
- ChatGPT uses InstructGPT, in which humans generate challenging prompts, and then a reward model is generated from them
- It uses Instruction + Query pairs