Recommendations for Technical AI Safety Research Directions

Summary

Anthropic outlines open technical AI safety research directions, covering capability evaluation, alignment metrics, model cognition, AI control, scalable oversight, adversarial robustness, unlearning,

Key quotes

We would be excited to see more high-quality evaluations for capabilities like these, alongside human baselines.

We are excited about work on improving the efficacy of behavioral monitoring.

Hence, we encourage research into new methods for unlearning information in LLMs.

The page is an internal Anthropic blog entry presenting a curated list of promising technical AI safety problems rather than a formal research agenda. It emphasizes directions the community could pursue even though Anthropic does not have capacity to fund all of them.