StoryNote
Log in
|
Sign up
[R] preference learning: RLHF, best-of-n sampling (BoN), or direct preference optimization (DPO)?
by
/u/gwern
in
/r/reinforcementlearning
Read on Reddit
Upvotes:
2
Favorite this post:
Mark as read:
Your rating:
--
10
9
8
7
6
5
4
3
2
1
0
Add this post to a custom list