Skip to main content
 Fazl Barez

Fazl Barez BA PhD MSc

Dr

Senior Researcher

Biography

Fazl Barez is a Senior Researcher at the University of Oxford, where he works on AI safety, interpretability, and technical AI governance. His research focuses on understanding the internal mechanisms of neural networks and developing methods to make advanced AI systems more transparent, reliable, and controllable.

His work examines how to identify the causes of harmful or unexpected model behaviour, verify whether a model’s reasoning faithfully reflects its underlying computations, and translate interpretability research into actionable oversight and governance mechanisms. He also studies the broader societal impacts of advanced AI systems, including questions of human agency, institutional control, and accountability. 

At Oxford, he teaches the AI Safety and Alignment course and supervises students working across AI safety and machine learning systems. He collaborates with leading industry research labs and is affiliated with the Cambridge's Centre for the Study of Existential Risk, ELLIS, and other international research centres.

Most Recent Publications

Beyond alignment: Why robotic foundation models need context-aware safety.

Beyond alignment: Why robotic foundation models need context-aware safety.

Altmetric score is

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

Altmetric score is

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

Altmetric score is

Curveball Steering: The Right Direction To Steer Isn't Always Linear

Curveball Steering: The Right Direction To Steer Isn't Always Linear

Altmetric score is

AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation

AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation

Altmetric score is
View all

Research Interests

Technical AI safety and alignment

Mechanistic interpretability and neural network analysis

Internal representations and reasoning in large language models

Detection and mitigation of harmful or deceptive model behaviour

Verification and evaluation of model reasoning processes

Robust removal of dangerous capabilities from AI systems

Automated interpretability and scalable oversight

Technical AI governance and AI auditing methods

AI evaluation, assurance, and accountability frameworks

Human agency and societal impacts of advanced AI systems

Reliability and controllability of foundation models I

Intersections between neuroscience, cognition, and machine learning

Current Research Projects

Mechanistic Interpretability for Large Language Models — Developing methods to identify and analyse the internal computational mechanisms underlying model behaviour and reasoning.

Scalable Oversight and Automated Verification — Building systems that can automatically trace, verify, and evaluate model reasoning processes at scale.

Capability Removal and Safety Interventions — Investigating methods for reliably removing dangerous or undesirable capabilities from neural networks without degrading general performance.

Technical Governance for Advanced AI Systems — Translating interpretability and safety research into practical frameworks for auditing, verification, and regulatory oversight.

Deception, Misalignment, and Hidden Objectives in AI Systems — Studying how deceptive or strategically misaligned behaviour emerges in advanced models and how it can be detected.

Human Agency and Societal Impacts of AI — Examining how increasingly capable AI systems affect human decision-making, institutional power, and societal autonomy.

Neuroscience-Inspired Approaches to Understanding Intelligence — Exploring connections between biological cognition and machine learning to better understand the foundations of intelligence.

Related Academics

Collaborators across the University of Oxford, the Centre for the Study of Existential Risk at the University of Cambridge, the School of Informatics, University of Edinburgh, NTU Digital Trust Centre, ELLIS, and leading industry AI research labs and Universities.

Most Recent Publications

Beyond alignment: Why robotic foundation models need context-aware safety.

Beyond alignment: Why robotic foundation models need context-aware safety.

Altmetric score is

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

Altmetric score is

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

Altmetric score is

Curveball Steering: The Right Direction To Steer Isn't Always Linear

Curveball Steering: The Right Direction To Steer Isn't Always Linear

Altmetric score is

AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation

AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation

Altmetric score is
View all