[R] preference learning: RLHF, best-of-n sampling (BoN), or direct preference optimization (DPO)?

by /u/gwern in /r/reinforcementlearning

Upvotes: 2

Favorite this post:

Mark as read:

Your rating:

Add this post to a custom list