Biography
Fazl Barez is a Senior Researcher at the University of Oxford, where he works on AI safety, interpretability, and technical AI governance. His research focuses on understanding the internal mechanisms of neural networks and developing methods to make advanced AI systems more transparent, reliable, and controllable.
His work examines how to identify the causes of harmful or unexpected model behaviour, verify whether a model’s reasoning faithfully reflects its underlying computations, and translate interpretability research into actionable oversight and governance mechanisms. He also studies the broader societal impacts of advanced AI systems, including questions of human agency, institutional control, and accountability.
At Oxford, he teaches the AI Safety and Alignment course and supervises students working across AI safety and machine learning systems. He collaborates with leading industry research labs and is affiliated with the Cambridge's Centre for the Study of Existential Risk, ELLIS, and other international research centres.
Research Interests
Technical AI safety and alignment
Mechanistic interpretability and neural network analysis
Internal representations and reasoning in large language models
Detection and mitigation of harmful or deceptive model behaviour
Verification and evaluation of model reasoning processes
Robust removal of dangerous capabilities from AI systems
Automated interpretability and scalable oversight
Technical AI governance and AI auditing methods
AI evaluation, assurance, and accountability frameworks
Human agency and societal impacts of advanced AI systems
Reliability and controllability of foundation models I
ntersections between neuroscience, cognition, and machine learning
Current Research Projects
Mechanistic Interpretability for Large Language Models — Developing methods to identify and analyse the internal computational mechanisms underlying model behaviour and reasoning.
Scalable Oversight and Automated Verification — Building systems that can automatically trace, verify, and evaluate model reasoning processes at scale.
Capability Removal and Safety Interventions — Investigating methods for reliably removing dangerous or undesirable capabilities from neural networks without degrading general performance.
Technical Governance for Advanced AI Systems — Translating interpretability and safety research into practical frameworks for auditing, verification, and regulatory oversight.
Deception, Misalignment, and Hidden Objectives in AI Systems — Studying how deceptive or strategically misaligned behaviour emerges in advanced models and how it can be detected.
Human Agency and Societal Impacts of AI — Examining how increasingly capable AI systems affect human decision-making, institutional power, and societal autonomy.
Neuroscience-Inspired Approaches to Understanding Intelligence — Exploring connections between biological cognition and machine learning to better understand the foundations of intelligence.
Research Groups
Related Academics
Collaborators across the University of Oxford, the Centre for the Study of Existential Risk at the University of Cambridge, the School of Informatics, University of Edinburgh, NTU Digital Trust Centre, ELLIS, and leading industry AI research labs and Universities.