Brief Research Overview
You can find a full overview of my papers on my Google Scholar profile.
Understanding Model Character, Psychology & Capabilities
- “Persona vectors: Monitoring and controlling character traits in language models” R Chen, A Arditi, H Sleight, O Evans, J Lindsey. arXiv preprint arXiv:2507.21509
- "Looking Inward: Language Models Can Learn About Themselves by Introspection" FJ Binder, J Chua, T Korbak, H Sleight, J Hughes, R Long, E Perez, et al. arXiv:2410.13787 (2024)
- "Unsupervised Elicitation of Language Models" J Wen, Z Ankner, A Somani, P Hase, S Marks, J Goldman-Wetzler, H Sleight, et al. arXiv:2506.10139 (2025)
- “Quantifying Elicitation of Latent Capabilities in Language Models” E Donoway, H Joren, A Somani, H Sleight, J Michael, MR DeWeese, et al. NeurIPS 2026.
- "Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data" M Gerstgrasser, R Schaeffer, A Dey, R Rafailov, H Sleight, J Hughes, et al. arXiv:2404.01413 (2024)
- “Inverse scaling in test-time compute”, AP Gema, A Hägele, R Chen, A Arditi, J Goldman-Wetzler, K Fraser-Taliente, H Sleight, et al. arXiv preprint arXiv:2507.14417
- “Stress-Testing Model Specs Reveals Character Differences among Language Models” J Zhang, H Sleight, A Peng, J Schulman, E Durmus. arXiv preprint arXiv:2510.07686
- “The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models” D Ensign, H Sleight, K Fish. arXiv preprint arXiv:2509.04781
- “Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment”N Wichers, A Ebtekar, A Azarbal, V Gillioz, C Ye, E Ryd, N Rathi, H Sleight, et al. arXiv preprint arXiv:2510.05024
Adversarial Robustness & Jailbreaking,
- "Best-of-n Jailbreaking" J Hughes, S Price, A Lynch, R Schaeffer, F Barez, S Koyejo, H Sleight, et al. arXiv:2412.03556 (2024)
- "Rapid Response: Mitigating LLM Jailbreaks with a Few Examples" A Peng, J Michael, H Sleight, E Perez, M Sharma arXiv:2411.07494 (2024)
- "Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach" TT Wang, J Hughes, H Sleight, R Schaeffer, R Agrawal, F Barez, et al. arXiv:2412.02159 (2024)
- "When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?" R Schaeffer, D Valentine, L Bailey, J Chua, C Eyzaguirre, Z Durante, et al. arXiv:2407.15211 (2024)
- "Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs" A Sheshadri, A Ewart, P Guo, A Lynch, C Wu, V Hebbar, H Sleight, et al. arXiv:2407.15549 (2024)
- “Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks” J Youstra, M Mahfoud, Y Yan, H Sleight, E Perez, M Sharma. arXiv preprint arXiv:2508.17158
AI Control:
- "SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents" J Kutasov, Y Sun, P Colognese, T van der Weij, L Petrini, CBC Zhang, H Sleight, et al. (2025)
- "Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats" J Wen, V Hebbar, C Larson, A Bhatt, A Radhakrishnan, M Sharma, H Sleight, et al. arXiv:2411.17693 (2024)
- “Believe It or Not: How Deeply do LLMs Believe Implanted Facts?” S Slocum, J Minder, C Dumas, H Sleight, R Greenblatt, S Marks, R Wang. arXiv preprint arXiv:2510.17941
- “Evaluating Control Protocols for Untrusted AI Agents” J Kutasov, C Loughridge, Y Sun, H Sleight, B Shlegeris, T Tracy, J Benton. arXiv preprint arXiv:2511.02997
- “All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language” S Guo, H Sleight, F Roger. arXiv preprint arXiv:2510.09714