arxiv:2412.14093
Buck Shlegeris
bshlgrs
·
AI & ML interests
None yet
Recent Activity
authored
a paper
24 days ago
Alignment faking in large language models
authored
a paper
12 months ago
AI Control: Improving Safety Despite Intentional Subversion
authored
a paper
12 months ago
Interpretability in the Wild: a Circuit for Indirect Object
Identification in GPT-2 small
Organizations
models
3
datasets
None public yet