§ feed · storyline
Improving Model Safety Behavior with Rule-Based Rewards
Anthropic develops a Rule-Based Rewards method that aligns model safety behaviour without requiring extensive human data collection.
We've developed and applied a new method leveraging Rule-Based Rewards (RBRs) that aligns models to behave safely without extensive human data collection.
§ sources1 publication · timeline below