Buck Shlegeris's picture

1

Buck Shlegeris

bshlgrs

·

bshlgrs

AI & ML interests

None yet

Recent Activity

authored a paper 24 days ago

Alignment faking in large language models

authored a paper 12 months ago

AI Control: Improving Safety Despite Intentional Subversion

authored a paper 12 months ago

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

View all activity

Organizations

Papers 4

arxiv:2412.14093

arxiv:2401.05566

arxiv:2312.06942

arxiv:2211.00593

models 3

bshlgrs/autonlp-old-data-trained-10022181

Text Classification • Updated Sep 9, 2021 • 21

bshlgrs/autonlp-classification_with_all_labellers-9532137

Text Classification • Updated Sep 4, 2021 • 9

bshlgrs/autonlp-classification-9522090

Text Classification • Updated Sep 4, 2021 • 18

datasets

None public yet