Brief Research Overview
You can find a full overview of my papers on my Google Scholar profile.
Understanding Model Psychology & Capabilities
- "Looking Inward: Language Models Can Learn About Themselves by Introspection" FJ Binder, J Chua, T Korbak, H Sleight, J Hughes, R Long, E Perez, et al. arXiv:2410.13787 (2024)
- "Unsupervised Elicitation of Language Models" J Wen, Z Ankner, A Somani, P Hase, S Marks, J Goldman-Wetzler, H Sleight, et al. arXiv:2506.10139 (2025)
- "Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data" M Gerstgrasser, R Schaeffer, A Dey, R Rafailov, H Sleight, J Hughes, et al. arXiv:2404.01413 (2024)
Adversarial Robustness & Jailbreaking,
- "Best-of-n Jailbreaking" J Hughes, S Price, A Lynch, R Schaeffer, F Barez, S Koyejo, H Sleight, et al. arXiv:2412.03556 (2024)
- "Rapid Response: Mitigating LLM Jailbreaks with a Few Examples" A Peng, J Michael, H Sleight, E Perez, M Sharma arXiv:2411.07494 (2024)
- "Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach" TT Wang, J Hughes, H Sleight, R Schaeffer, R Agrawal, F Barez, et al. arXiv:2412.02159 (2024)
- "When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?" R Schaeffer, D Valentine, L Bailey, J Chua, C Eyzaguirre, Z Durante, et al. arXiv:2407.15211 (2024)
- "Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs" A Sheshadri, A Ewart, P Guo, A Lynch, C Wu, V Hebbar, H Sleight, et al. arXiv:2407.15549 (2024)
AI Control:
- "SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents" J Kutasov, Y Sun, P Colognese, T van der Weij, L Petrini, CBC Zhang, H Sleight, et al. (2025)
- "Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats" J Wen, V Hebbar, C Larson, A Bhatt, A Radhakrishnan, M Sharma, H Sleight, et al. arXiv:2411.17693 (2024)