Publications

Beyond alignment: Why robotic foundation models need context-aware safety.

Robey A, Ravichandran Z, Jones EK, Perlo J, Barez F et al. (2026), Science Robotics, 11(113), eaef2191

Altmetric score is

BibTeX

@article{beyondalignment-2026/4,
  title={Beyond alignment: Why robotic foundation models need context-aware safety.},
  author={Robey A, Ravichandran Z, Jones EK, Perlo J, Barez F et al.},
  journal={Science Robotics},
  volume={11},
  pages={eaef2191},
  year = "2026"
}

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

Oldfield J, Torr P, Patras I, Bibi A & Barez F (2026)

Altmetric score is

BibTeX

@misc{beyondlinearpro-2026/4,
  title={Beyond Linear Probes: Dynamic Safety Monitoring for Language Models},
  author={Oldfield J, Torr P, Patras I, Bibi A & Barez F},
  year = "2026"
}

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

Oozeer N, Marks L, Jain S, Barez F & Abdullah A (2026)

Altmetric score is

BibTeX

@misc{beyondlinearste-2026/4,
  title={Beyond Linear Steering: Unified Multi-Attribute Control for Language Models},
  author={Oozeer N, Marks L, Jain S, Barez F & Abdullah A},
  year = "2026"
}

Curveball Steering: The Right Direction To Steer Isn't Always Linear

Raval S, Song HJ, Wu L, Harrasse A, Phillips JM et al. (2026)

Altmetric score is

BibTeX

@misc{curveballsteeri-2026/3,
  title={Curveball Steering: The Right Direction To Steer Isn't Always Linear},
  author={Raval S, Song HJ, Wu L, Harrasse A, Phillips JM et al.},
  year = "2026"
}

AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation

Li C, Lu P, Pan X, Barez F & Yang M (2026)

Altmetric score is

BibTeX

@misc{autocontrolaren-2026/3,
  title={AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation},
  author={Li C, Lu P, Pan X, Barez F & Yang M},
  year = "2026"
}

Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer

Schrodi S, Kempf E, Barez F & Brox T (2026)

Altmetric score is

BibTeX

@misc{towardsundersta-2026/3,
  title={Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer},
  author={Schrodi S, Kempf E, Barez F & Brox T},
  year = "2026"
}

Token Taxes: mitigating AGI's economic risks

Irwin L, Wu T-Y & Barez F (2026)

Altmetric score is

BibTeX

@misc{tokentaxesmitig-2026/3,
  title={Token Taxes: mitigating AGI's economic risks},
  author={Irwin L, Wu T-Y & Barez F},
  year = "2026"
}

Old Habits Die Hard: How Conversational History Geometrically Traps LLMs

Simhi A, Barez F, Tutek M, Belinkov Y & Cohen SB (2026)

Altmetric score is

BibTeX

@misc{oldhabitsdiehar-2026/2,
  title={Old Habits Die Hard: How Conversational History Geometrically Traps LLMs},
  author={Simhi A, Barez F, Tutek M, Belinkov Y & Cohen SB},
  year = "2026"
}

Same Answer, Different Representations: Hidden instability in VLMs

Wani FA, Suglia A, Saxena R, Gema AP, Kwan W-C et al. (2026)

Altmetric score is

BibTeX

@misc{sameanswerdiffe-2026/2,
  title={Same Answer, Different Representations: Hidden instability in VLMs},
  author={Wani FA, Suglia A, Saxena R, Gema AP, Kwan W-C et al.},
  year = "2026"
}

Chain-of-Thought Hijacking

Zhao J, Fu T, Schaeffer R, Sharma M & Barez F (2026)

Altmetric score is

BibTeX

@misc{chainofthoughth-2026/2,
  title={Chain-of-Thought Hijacking},
  author={Zhao J, Fu T, Schaeffer R, Sharma M & Barez F},
  year = "2026"
}

VAL-Bench: Belief Consistency as a measure for Value Alignment in Language Models

Gupta A, O'Shea D & Barez F (2026)

Altmetric score is

BibTeX

@misc{valbenchbeliefc-2026/1,
  title={VAL-Bench: Belief Consistency as a measure for Value Alignment in Language Models},
  author={Gupta A, O'Shea D & Barez F},
  year = "2026"
}

Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value

Edelman J, Zhi-Xuan T, Lowe R, Klingefjord O, Wang-Mascianica V et al. (2025)

Altmetric score is

BibTeX

@misc{fullstackalignm-2025/12,
  title={Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value},
  author={Edelman J, Zhi-Xuan T, Lowe R, Klingefjord O, Wang-Mascianica V et al.},
  year = "2025"
}

Precise In-Parameter Concept Erasure in Large Language Models

Gur-Arieh Y, Suslik C, Hong Y, Barez F & Geva M (2025)

Altmetric score is

BibTeX

@misc{preciseinparame-2025/10,
  title={Precise In-Parameter Concept Erasure in Large Language Models},
  author={Gur-Arieh Y, Suslik C, Hong Y, Barez F & Geva M},
  year = "2025"
}

HACK: Hallucinations Along Certainty and Knowledge Axes

Simhi A, Herzig J, Itzhak I, Arad D, Gekhman Z et al. (2025)

Altmetric score is

BibTeX

@misc{hackhallucinati-2025/10,
  title={HACK: Hallucinations Along Certainty and Knowledge Axes},
  author={Simhi A, Herzig J, Itzhak I, Arad D, Gekhman Z et al.},
  year = "2025"
}

Beyond Monoliths: Expert Orchestration for More Capable, Democratic, and Safe Language Models

Quirke P, Oozeer N, Bandi C, Abdullah A, Hoelscher-Obermaier J et al. (2025)

Altmetric score is

BibTeX

@misc{beyondmonoliths-2025/10,
  title={Beyond Monoliths: Expert Orchestration for More Capable, Democratic, and Safe Language Models},
  author={Quirke P, Oozeer N, Bandi C, Abdullah A, Hoelscher-Obermaier J et al.},
  year = "2025"
}

Understanding Addition and Subtraction in Transformers

Quirke P, Neo C & Barez F (2025)

Altmetric score is

BibTeX

@misc{understandingad-2025/10,
  title={Understanding Addition and Subtraction in Transformers},
  author={Quirke P, Neo C & Barez F},
  year = "2025"
}

Query Circuits: Explaining How Language Models Answer User Prompts

Wu T-Y & Barez F (2025)

Altmetric score is

BibTeX

@misc{querycircuitsex-2025/9,
  title={Query Circuits: Explaining How Language Models Answer User Prompts},
  author={Wu T-Y & Barez F},
  year = "2025"
}

Interpreting Learned Feedback Patterns in Large Language Models

Marks L, Abdullah A, Neo C, Arike R, Krueger D et al. (2025)

Altmetric score is

BibTeX

@misc{interpretinglea-2025/9,
  title={Interpreting Learned Feedback Patterns in Large Language Models},
  author={Marks L, Abdullah A, Neo C, Arike R, Krueger D et al.},
  year = "2025"
}

Do Sparse Autoencoders Generalize? A Case Study of Answerability

Heindrich L, Torr P, Barez F & Thost V (2025)

Altmetric score is

BibTeX

@misc{dosparseautoenc-2025/9,
  title={Do Sparse Autoencoders Generalize? A Case Study of Answerability},
  author={Heindrich L, Torr P, Barez F & Thost V},
  year = "2025"
}

Embodied AI: Emerging Risks and Opportunities for Policy Action

Perlo J, Robey A, Barez F, Floridi L & Mökander J (2025)

Altmetric score is

BibTeX

@misc{embodiedaiemerg-2025/9,
  title={Embodied AI: Emerging Risks and Opportunities for Policy Action},
  author={Perlo J, Robey A, Barez F, Floridi L & Mökander J},
  year = "2025"
}

Trust Me, I'm Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer

Simhi A, Itzhak I, Barez F, Stanovsky G & Belinkov Y (2025)

Altmetric score is

BibTeX

@misc{trustmeimwrongl-2025/8,
  title={Trust Me, I'm Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer},
  author={Simhi A, Itzhak I, Barez F, Stanovsky G & Belinkov Y},
  year = "2025"
}

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Kim M, Kwak JM, Alssum L, Ghanem B, Torr P et al. (2025)

Altmetric score is

BibTeX

@misc{rethinkingsafet-2025/8,
  title={Rethinking Safety in LLM Fine-tuning: An Optimization Perspective},
  author={Kim M, Kwak JM, Alssum L, Ghanem B, Torr P et al.},
  year = "2025"
}

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Zhu Y, Jin T, Pruksachatkun Y, Zhang A, Liu S et al. (2025)

Altmetric score is

BibTeX

@misc{establishingbes-2025/8,
  title={Establishing Best Practices for Building Rigorous Agentic Benchmarks},
  author={Zhu Y, Jin T, Pruksachatkun Y, Zhang A, Liu S et al.},
  year = "2025"
}

The Singapore Consensus on Global AI Safety Research Priorities

Bengio Y, Maharaj T, Ong L, Russell S, Song D et al. (2025)

Altmetric score is

BibTeX

@misc{thesingaporecon-2025/6,
  title={The Singapore Consensus on Global AI Safety Research Priorities},
  author={Bengio Y, Maharaj T, Ong L, Russell S, Song D et al.},
  year = "2025"
}

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

Fu T, Sharma M, Torr P, Cohen SB, Krueger D et al. (2025)

Altmetric score is

BibTeX

@misc{poisonbenchasse-2025/6,
  title={PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning},
  author={Fu T, Sharma M, Torr P, Cohen SB, Krueger D et al.},
  year = "2025"
}

Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders

Lan M, Torr P, Meek A, Khakzar A, Krueger D et al. (2025)

Altmetric score is

BibTeX

@misc{quantifyingfeat-2025/5,
  title={Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders},
  author={Lan M, Torr P, Meek A, Khakzar A, Krueger D et al.},
  year = "2025"
}

SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors

Chaudhary M & Barez F (2025)

Altmetric score is

BibTeX

@misc{safetynetdetect-2025/5,
  title={SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors},
  author={Chaudhary M & Barez F},
  year = "2025"
}

In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate?

Bucknall B, Siddiqui S, Thurnherr L, McGurk C, Harack B et al. (2025)

Altmetric score is

BibTeX

@misc{inwhichareasoft-2025/4,
  title={In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate?},
  author={Bucknall B, Siddiqui S, Thurnherr L, McGurk C, Harack B et al.},
  year = "2025"
}

Towards Interpreting Visual Information Processing in Vision-Language Models

Neo C, Ong L, Torr P, Geva M, Krueger D et al. (2025)

Altmetric score is

BibTeX

@misc{towardsinterpre-2025/4,
  title={Towards Interpreting Visual Information Processing in Vision-Language Models},
  author={Neo C, Ong L, Torr P, Geva M, Krueger D et al.},
  year = "2025"
}

Scaling sparse feature circuit finding for in-context learning

Kharlapenko D, Shabalin S, Barez F, Conmy A & Nanda N (2025)

Altmetric score is

BibTeX

@misc{scalingsparsefe-2025/4,
  title={Scaling sparse feature circuit finding for in-context learning},
  author={Kharlapenko D, Shabalin S, Barez F, Conmy A & Nanda N},
  year = "2025"
}

AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons

Ghosh S, Frase H, Williams A, Luger S, Röttger P et al. (2025)

Altmetric score is

BibTeX

@misc{ailuminateintro-2025/4,
  title={AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons},
  author={Ghosh S, Frase H, Williams A, Luger S, Röttger P et al.},
  year = "2025"
}

Rethinking AI Cultural Alignment

Bravansky M, Trhlik F & Barez F (2025)

Altmetric score is

BibTeX

@misc{rethinkingaicul-2025/3,
  title={Rethinking AI Cultural Alignment},
  author={Bravansky M, Trhlik F & Barez F},
  year = "2025"
}

Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness

Fu T & Barez F (2025)

Altmetric score is

BibTeX

@misc{samequestiondif-2025/3,
  title={Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness},
  author={Fu T & Barez F},
  year = "2025"
}

Towards interpreting visual information processing in vision-language models

Neo C, Ong L, Torr P, Geva M, Krueger D et al. (2025), Proceedings of the 13th International Conference on Learning Representations (ICLR 2025)

Altmetric score is

BibTeX

@inproceedings{towardsinterpre-2025/1,
  title={Towards interpreting visual information processing in vision-language models},
  author={Neo C, Ong L, Torr P, Geva M, Krueger D et al.},
  booktitle={13th International Conference on Learning Representations (ICLR 2025)},
  year = "2025"
}

Trust Me, I’m Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer

Simhi A, Itzhak I, Barez F, Stanovsky G & Belinkov Y (2025), 14665-14688

Altmetric score is

BibTeX

@inproceedings{trustmeimwrongl-2025/1,
  title={Trust Me, I’m Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer},
  author={Simhi A, Itzhak I, Barez F, Stanovsky G & Belinkov Y},
  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2025},
  pages={14665-14688},
  year = "2025"
}

POISONBENCH: Assessing Language Model Vulnerability to Poisoned Preference Data

Fu T, Sharma M, Torr P, Cohen S, Krueger D et al. (2025), Proceedings of Machine Learning Research, 267, 17806-17831

Altmetric score is

BibTeX

@inproceedings{poisonbenchasse-2025/1,
  title={POISONBENCH: Assessing Language Model Vulnerability to Poisoned Preference Data},
  author={Fu T, Sharma M, Torr P, Cohen S, Krueger D et al.},
  pages={17806-17831},
  year = "2025"
}

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

Oozeer NF, Marks L, Barez F & Abdullah A (2025), 23513-23557

Altmetric score is

BibTeX

@inproceedings{beyondlinearste-2025/1,
  title={Beyond Linear Steering: Unified Multi-Attribute Control for Language Models},
  author={Oozeer NF, Marks L, Barez F & Abdullah A},
  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2025},
  pages={23513-23557},
  year = "2025"
}

Open Problems in Machine Unlearning for AI Safety

Barez F, Fu T, Prabhu A, Casper S, Sanyal A et al. (2025)

Altmetric score is

BibTeX

@misc{openproblemsinm-2025/1,
  title={Open Problems in Machine Unlearning for AI Safety},
  author={Barez F, Fu T, Prabhu A, Casper S, Sanyal A et al.},
  year = "2025"
}

Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness

Fu T & Barez F (2025), 31293-31307

Altmetric score is

BibTeX

@inproceedings{samequestiondif-2025/1,
  title={Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness},
  author={Fu T & Barez F},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  pages={31293-31307},
  year = "2025"
}

Precise In-Parameter Concept Erasure in Large Language Models

Gur-Arieh Y, Suslik CH, Hong Y, Barez F & Geva M (2025), 18997-19017

Altmetric score is

BibTeX

@inproceedings{preciseinparame-2025/1,
  title={Precise In-Parameter Concept Erasure in Large Language Models},
  author={Gur-Arieh Y, Suslik CH, Hong Y, Barez F & Geva M},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  pages={18997-19017},
  year = "2025"
}

Best-of-N Jailbreaking

Hughes J, Price S, Lynch A, Schaeffer R, Barez F et al. (2024)

Altmetric score is

BibTeX

@misc{bestofnjailbrea-2024/12,
  title={Best-of-N Jailbreaking},
  author={Hughes J, Price S, Lynch A, Schaeffer R, Barez F et al.},
  year = "2024"
}

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

Wang TT, Hughes J, Sleight H, Schaeffer R, Agrawal R et al. (2024)

Altmetric score is

BibTeX

@misc{jailbreakdefens-2024/12,
  title={Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach},
  author={Wang TT, Hughes J, Sleight H, Schaeffer R, Agrawal R et al.},
  year = "2024"
}

Towards interpretable sequence continuation: analyzing shared circuits in large language models

Lan M, Torr P & Barez F (2024), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 12576-12601

Altmetric score is

BibTeX

@inproceedings{towardsinterpre-2024/11,
  title={Towards interpretable sequence continuation: analyzing shared circuits in large language models},
  author={Lan M, Torr P & Barez F},
  booktitle={Conference on Empirical Methods in Natural Language Processing (EMNLP 2024)},
  pages={12576-12601},
  year = "2024"
}

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders

Marks L, Paren A, Krueger D & Barez F (2024)

Altmetric score is

BibTeX

@misc{enhancingneural-2024/11,
  title={Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders},
  author={Marks L, Paren A, Krueger D & Barez F},
  year = "2024"
}

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

Neo C, Cohen SB & Barez F (2024)

Altmetric score is

BibTeX

@misc{interpretingcon-2024/10,
  title={Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions},
  author={Neo C, Cohen SB & Barez F},
  year = "2024"
}

Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models

Lan M, Torr P & Barez F (2024)

Altmetric score is

BibTeX

@misc{towardsinterpre-2024/10,
  title={Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models},
  author={Lan M, Torr P & Barez F},
  year = "2024"
}

Interpreting learned feedback patterns in large language models

Marks L, Abdullah A, Neo C, Arike R, Krueger D et al. (2024), Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 37, 36541-36566

Altmetric score is

BibTeX

@inproceedings{interpretinglea-2024/9,
  title={Interpreting learned feedback patterns in large language models},
  author={Marks L, Abdullah A, Neo C, Arike R, Krueger D et al.},
  booktitle={38th Conference on Neural Information Processing Systems (NeurIPS 2024)},
  pages={36541-36566},
  year = "2024"
}

Position: near to mid-term risks and opportunities of open-source generative AI

Eiras F, Petrov A, Vidgen B, Schroeder de Witt C, Pizzati F et al. (2024), Proceedings of the 41st International Conference on Machine Learning, 12348-12370

Altmetric score is

BibTeX

@inproceedings{positionneartom-2024/7,
  title={Position: near to mid-term risks and opportunities of open-source generative AI},
  author={Eiras F, Petrov A, Vidgen B, Schroeder de Witt C, Pizzati F et al.},
  booktitle={41st International Conference on Machine Learning (ICML 2024)},
  pages={12348-12370},
  year = "2024"
}

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Denison C, MacDiarmid M, Barez F, Duvenaud D, Kravec S et al. (2024)

Altmetric score is

BibTeX

@misc{sycophancytosub-2024/6,
  title={Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models},
  author={Denison C, MacDiarmid M, Barez F, Duvenaud D, Kravec S et al.},
  year = "2024"
}

Risks and Opportunities of Open-Source Generative AI

Eiras F, Petrov A, Vidgen B, Schroeder C, Pizzati F et al. (2024)

Altmetric score is

BibTeX

@misc{risksandopportu-2024/5,
  title={Risks and Opportunities of Open-Source Generative AI},
  author={Eiras F, Petrov A, Vidgen B, Schroeder C, Pizzati F et al.},
  year = "2024"
}

Showing 50 publications by Fazl Barez

Beyond alignment: Why robotic foundation models need context-aware safety.

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

Curveball Steering: The Right Direction To Steer Isn't Always Linear

AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation

Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer

Token Taxes: mitigating AGI's economic risks

Old Habits Die Hard: How Conversational History Geometrically Traps LLMs

Same Answer, Different Representations: Hidden instability in VLMs

Chain-of-Thought Hijacking

VAL-Bench: Belief Consistency as a measure for Value Alignment in Language Models

Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value

Precise In-Parameter Concept Erasure in Large Language Models

HACK: Hallucinations Along Certainty and Knowledge Axes

Beyond Monoliths: Expert Orchestration for More Capable, Democratic, and Safe Language Models

Understanding Addition and Subtraction in Transformers

Query Circuits: Explaining How Language Models Answer User Prompts

Interpreting Learned Feedback Patterns in Large Language Models

Do Sparse Autoencoders Generalize? A Case Study of Answerability

Embodied AI: Emerging Risks and Opportunities for Policy Action

Trust Me, I'm Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Establishing Best Practices for Building Rigorous Agentic Benchmarks

The Singapore Consensus on Global AI Safety Research Priorities

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders

SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors

In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate?

Towards Interpreting Visual Information Processing in Vision-Language Models

Scaling sparse feature circuit finding for in-context learning

AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons

Rethinking AI Cultural Alignment

Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness

Towards interpreting visual information processing in vision-language models

Trust Me, I’m Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer

POISONBENCH: Assessing Language Model Vulnerability to Poisoned Preference Data

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

Open Problems in Machine Unlearning for AI Safety

Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness

Precise In-Parameter Concept Erasure in Large Language Models

Best-of-N Jailbreaking

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

Towards interpretable sequence continuation: analyzing shared circuits in large language models

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models

Interpreting learned feedback patterns in large language models

Position: near to mid-term risks and opportunities of open-source generative AI

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Risks and Opportunities of Open-Source Generative AI