Showing 50 publications by Fazl Barez
Beyond alignment: Why robotic foundation models need context-aware safety.
Robey A, Ravichandran Z, Jones EK, Perlo J, Barez F et al. (2026), Science Robotics, 11(113), eaef2191
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
Oldfield J, Torr P, Patras I, Bibi A & Barez F (2026)
Beyond Linear Steering: Unified Multi-Attribute Control for Language Models
Oozeer N, Marks L, Jain S, Barez F & Abdullah A (2026)
Curveball Steering: The Right Direction To Steer Isn't Always Linear
Raval S, Song HJ, Wu L, Harrasse A, Phillips JM et al. (2026)
AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation
Li C, Lu P, Pan X, Barez F & Yang M (2026)
Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer
Schrodi S, Kempf E, Barez F & Brox T (2026)
Old Habits Die Hard: How Conversational History Geometrically Traps LLMs
Simhi A, Barez F, Tutek M, Belinkov Y & Cohen SB (2026)
Same Answer, Different Representations: Hidden instability in VLMs
Wani FA, Suglia A, Saxena R, Gema AP, Kwan W-C et al. (2026)
Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value
Edelman J, Zhi-Xuan T, Lowe R, Klingefjord O, Wang-Mascianica V et al. (2025)
BibTeX
@misc{fullstackalignm-2025/12,
title={Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value},
author={Edelman J, Zhi-Xuan T, Lowe R, Klingefjord O, Wang-Mascianica V et al.},
year = "2025"
}
Beyond Monoliths: Expert Orchestration for More Capable, Democratic, and Safe Language Models
Quirke P, Oozeer N, Bandi C, Abdullah A, Hoelscher-Obermaier J et al. (2025)
Interpreting Learned Feedback Patterns in Large Language Models
Marks L, Abdullah A, Neo C, Arike R, Krueger D et al. (2025)
Embodied AI: Emerging Risks and Opportunities for Policy Action
Perlo J, Robey A, Barez F, Floridi L & Mökander J (2025)
Trust Me, I'm Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer
Simhi A, Itzhak I, Barez F, Stanovsky G & Belinkov Y (2025)
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective
Kim M, Kwak JM, Alssum L, Ghanem B, Torr P et al. (2025)
Establishing Best Practices for Building Rigorous Agentic Benchmarks
Zhu Y, Jin T, Pruksachatkun Y, Zhang A, Liu S et al. (2025)
The Singapore Consensus on Global AI Safety Research Priorities
Bengio Y, Maharaj T, Ong L, Russell S, Song D et al. (2025)
PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
Fu T, Sharma M, Torr P, Cohen SB, Krueger D et al. (2025)
Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders
Lan M, Torr P, Meek A, Khakzar A, Krueger D et al. (2025)
SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors
Chaudhary M & Barez F (2025)
In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate?
Bucknall B, Siddiqui S, Thurnherr L, McGurk C, Harack B et al. (2025)
BibTeX
@misc{inwhichareasoft-2025/4,
title={In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate?},
author={Bucknall B, Siddiqui S, Thurnherr L, McGurk C, Harack B et al.},
year = "2025"
}
Towards Interpreting Visual Information Processing in Vision-Language Models
Neo C, Ong L, Torr P, Geva M, Krueger D et al. (2025)
Scaling sparse feature circuit finding for in-context learning
Kharlapenko D, Shabalin S, Barez F, Conmy A & Nanda N (2025)
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons
Ghosh S, Frase H, Williams A, Luger S, Röttger P et al. (2025)
Towards interpreting visual information processing in vision-language models
Neo C, Ong L, Torr P, Geva M, Krueger D et al. (2025), Proceedings of the 13th International Conference on Learning Representations (ICLR 2025)
BibTeX
@inproceedings{towardsinterpre-2025/1,
title={Towards interpreting visual information processing in vision-language models},
author={Neo C, Ong L, Torr P, Geva M, Krueger D et al.},
booktitle={13th International Conference on Learning Representations (ICLR 2025)},
year = "2025"
}
Trust Me, I’m Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer
Simhi A, Itzhak I, Barez F, Stanovsky G & Belinkov Y (2025), 14665-14688
BibTeX
@inproceedings{trustmeimwrongl-2025/1,
title={Trust Me, I’m Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer},
author={Simhi A, Itzhak I, Barez F, Stanovsky G & Belinkov Y},
booktitle={Findings of the Association for Computational Linguistics: EMNLP 2025},
pages={14665-14688},
year = "2025"
}
POISONBENCH: Assessing Language Model Vulnerability to Poisoned Preference Data
Fu T, Sharma M, Torr P, Cohen S, Krueger D et al. (2025), Proceedings of Machine Learning Research, 267, 17806-17831
BibTeX
@inproceedings{poisonbenchasse-2025/1,
title={POISONBENCH: Assessing Language Model Vulnerability to Poisoned Preference Data},
author={Fu T, Sharma M, Torr P, Cohen S, Krueger D et al.},
pages={17806-17831},
year = "2025"
}
Beyond Linear Steering: Unified Multi-Attribute Control for Language Models
Oozeer NF, Marks L, Barez F & Abdullah A (2025), 23513-23557
BibTeX
@inproceedings{beyondlinearste-2025/1,
title={Beyond Linear Steering: Unified Multi-Attribute Control for Language Models},
author={Oozeer NF, Marks L, Barez F & Abdullah A},
booktitle={Findings of the Association for Computational Linguistics: EMNLP 2025},
pages={23513-23557},
year = "2025"
}
Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness
Fu T & Barez F (2025), 31293-31307
BibTeX
@inproceedings{samequestiondif-2025/1,
title={Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness},
author={Fu T & Barez F},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
pages={31293-31307},
year = "2025"
}
Precise In-Parameter Concept Erasure in Large Language Models
Gur-Arieh Y, Suslik CH, Hong Y, Barez F & Geva M (2025), 18997-19017
BibTeX
@inproceedings{preciseinparame-2025/1,
title={Precise In-Parameter Concept Erasure in Large Language Models},
author={Gur-Arieh Y, Suslik CH, Hong Y, Barez F & Geva M},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
pages={18997-19017},
year = "2025"
}
Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
Wang TT, Hughes J, Sleight H, Schaeffer R, Agrawal R et al. (2024)
Towards interpretable sequence continuation: analyzing shared circuits in large language models
Lan M, Torr P & Barez F (2024), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 12576-12601
BibTeX
@inproceedings{towardsinterpre-2024/11,
title={Towards interpretable sequence continuation: analyzing shared circuits in large language models},
author={Lan M, Torr P & Barez F},
booktitle={Conference on Empirical Methods in Natural Language Processing (EMNLP 2024)},
pages={12576-12601},
year = "2024"
}
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders
Marks L, Paren A, Krueger D & Barez F (2024)
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Neo C, Cohen SB & Barez F (2024)
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models
Lan M, Torr P & Barez F (2024)
Interpreting learned feedback patterns in large language models
Marks L, Abdullah A, Neo C, Arike R, Krueger D et al. (2024), Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 37, 36541-36566
BibTeX
@inproceedings{interpretinglea-2024/9,
title={Interpreting learned feedback patterns in large language models},
author={Marks L, Abdullah A, Neo C, Arike R, Krueger D et al.},
booktitle={38th Conference on Neural Information Processing Systems (NeurIPS 2024)},
pages={36541-36566},
year = "2024"
}
Position: near to mid-term risks and opportunities of open-source generative AI
Eiras F, Petrov A, Vidgen B, Schroeder de Witt C, Pizzati F et al. (2024), Proceedings of the 41st International Conference on Machine Learning, 12348-12370
BibTeX
@inproceedings{positionneartom-2024/7,
title={Position: near to mid-term risks and opportunities of open-source generative AI},
author={Eiras F, Petrov A, Vidgen B, Schroeder de Witt C, Pizzati F et al.},
booktitle={41st International Conference on Machine Learning (ICML 2024)},
pages={12348-12370},
year = "2024"
}
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Denison C, MacDiarmid M, Barez F, Duvenaud D, Kravec S et al. (2024)
Risks and Opportunities of Open-Source Generative AI
Eiras F, Petrov A, Vidgen B, Schroeder C, Pizzati F et al. (2024)