John Schulman (PPO, OA co-founder, post-training/RLHF) leaves OpenAI for Anthropic | 72 | gwern | | | | |
Sharing my JAX-based RL Algorithms Repository - Including BBF and TD7 Implementations | 23 | New_East832 | | | | |
Why does the agent do not learn to get to the cube position ? | 16 | CoolestSlave | | | | |
Since Offline RL is environment-independent, why are many paper implementations still based on gym? | 16 | Desperate_List4312 | | | | |
Why does Efficient Zero V2 work? | 12 | Automatic-Web8429 | | | | |
"Pareto" in layman's terms? | 8 | WilhelmRedemption | | | | |
Very Slow Environment - Should I pivot to Offline RL? | 8 | NoNeighborhood9302 | | | | |
A New Survey -- Generative Models for Offline Policy Learning | 8 | Ashamed-Put-2344 | | | | |
RLHF in LLMs: Variable action space? | 8 | No_Individual_7831 | | | | |
Switching academic careerpath from ML to RL | 7 | SenecaEnjoyer69 | | | | |