Adel Bibi

Senior Researcher in Machine Learning and R&D Distinguished Advisor

Biography

Adel Bibi is a Senior Researcher in Machine Learning and Computer Vision at the Department of Engineering Science at the University of Oxford, a Research Member of the Common Room and a former Junior Fellow at Kellogg College, and a member of the ELLIS Society. He is also an R&D Distinguished Advisor at SoftServe. Previously, he was a Senior Research Associate and postdoctoral researcher in the Torr Vision Group at Oxford working with Philip H.S. Torr. He obtained his MSc and PhD from KAUST in 2016 and 2020, respectively, under the supervision of Bernard Ghanem.

His research focuses on trustworthy and robust AI, agentic AI safety, and adversarial machine learning. His recent work on vulnerabilities in AI agents, transferable adversarial attacks, and AI safety evaluations has received broad international attention, including coverage by Scientific American, The Guardian, NBC News, Computerphile, TLDR News, Globo TV, and the Hugging Face Evaluation Guidebook. His work has received multiple paper and service distinctions, including four best/outstanding workshop paper awards (NeurIPS'23, ICML'23, CVPR'22, OBD'18), four outstanding reviewer awards (CVPR18, CVPR19, ICCV19, ICLR22), and a Notable Area Chair Award at NeurIPS 2023. He regularly serves as a Senior Area Chair and Area Chair for major AI conferences including NeurIPS, ICML, ICLR, etc. Bibi has authored more than 50 publications in top-tier machine learning and computer vision venues, including NeurIPS, ICML, ICLR, CVPR, ICCV, ECCV, AAAI, and TPAMI.

Download my resume

[Note!] I am always looking for strong self-motivated PhD students. If you are interested in AI Safety, Trustworthy, and Security of AI models and Agentic AI, reach out!

[Consulting Expertise] I have consulted in the past on projects spanning core machine learning and data science, computer vision, certification and AI safety, optimization formulations for matching and resource allocation problems, among other areas.

Background

Education

King Abdullah University of Science and Technology (KAUST)

PhD in Electrical Engineering (4.0/4.0)

2020

Machine Learning & Optimization Track

Thesis: Understanding a Block of Layers in Deep Neural Networks: Optimization, Probabilistic and Tropical Geometric Perspectives

MSc in Electrical Engineering (4.0/4.0)

2016

Computer Vision Track

Thesis: Advances in RGB and RGBD Generic Object Trackers

Kuwait University

BSc in Electrical Engineering (3.99/4.0)

2014

Thesis: Synthetic Artificial Intelligent Clothes

Appointments

University of Oxford

Senior Researcher

Mar 2023 – Present

Leading 3 PhD students and 3 postdocs on AI safety research.
Awarded Horizon Europe Grant, KAUST-CRG grant, UK AISI Safety Grant, Toyota Motor Europe Grant, and Google Gemma 2 Award ($2.152M total).

Senior Research Associate

Dec 2021 – Feb 2023

Co-supervised 5 PhD students on robustness and continual learning.
Amazon Research Award, ICLR 2022 Highlighted Reviewer, AAAI 2023 AC, ELLIS member.

Postdoctoral Research Assistant

Oct 2020 – Dec 2021

Junior Fellow, Kellogg College; KAUST-CRG grantee (~$210k).

SoftServe

R&D Distinguished Advisor

Aug 2024 – Present

Leading two research programs within R&D: (1) Agentic AI and (2) AI Security with 10+ engineers.

Ironclad AI

Executive Director

Aug 2025 – Present

AI Strategy, research, and product.

DESAISIV

Chief AI Officer

Feb 2023 – Present

Leading data scientists to develop AI technologies for insurance.

Intel Labs

PhD Research Intern

Jun – Nov 2018

Studying the theoretical connections between feed-forward layers and stochastic solvers.

Graduated PhD Students

Cornelius Emde

PhD, University of Oxford

Departed to Meta

Aleksandar Petrov

PhD, University of Oxford

Departed to Google DeepMind

Francisco Eiras

PhD, University of Oxford

Departed to DynamoAI

Csaba Botos

PhD, University of Oxford

Departed to a Hedge Fund in London

Ameya Prabhu

PhD, University of Oxford

Departed to the University of Tübingen

Graduated MSc Students

Sam Wang

MSc, University of Oxford, 2025

Jonathan Sneh

MSc, University of Oxford, 2025

Akshat Naik

MSc, University of Oxford, 2025

Thierry Blankenstein

MSc, University of Oxford, 2025

Preetham Arvind

MSc, University of Oxford, 2024

Luca Furtuna

MSc, University of Oxford, 2024

Enbo Lyu

MSc, University of Oxford, 2024

Ion Orins

MSc, University of Oxford, 2023

Tuhin Das

MSc, University of Oxford, 2023

Taras Rumezhak

MSc, University of Oxford, 2023

Visitors

Ziyan Wang

Visitor, King's College London, March 2026 - Present

Carlos Hinojosa

Visitor, KAUST, May 2026

Alisia Baielli

Visitor, University of Amsterdam, May 2026

Kumail Alhamoud

Visitor, MIT, May 2024 - September 2024

Lama Alssum

Visitor, KAUST, October 2024 - January 2025

Hasan Abed Al Kader Hammoud

Visitor, KAUST, July 2023 - November 2023

Wenxuan Zhang

Visitor, July 2023 - November 2023

Yasir Ghunaim

Visitor, KAUST, July 2023 - November 2023

Motasem Alfarra

Visitor, KAUST, October 2021 - March 2022

News

[July 9th, 2026]: Our paper LLMs Struggle to Rank Products Robustly received one of the first-ever MusIML Best Paper Awards at the ICML 2026 Workshop. Congratulations to Kumail Alhamoud!
[April 30th, 2026]: One papers accepted to ICML 2026.
[March 30th, 2026]: Shortlisted for Oxford University Vice-Chancellor’s Breakthrough Researcher Award: Recognizing researchers at the early stages of their careers who have made a significant impact at the University. Breakthrough Research in Agentic AI Safety and Security. Developing safeguards to identify vulnerabilities in AI agents to prevent them from leaking sensitive information or assisting malicious actors.
[February 21st, 2026]: One paper accepted to CVPR 2026.
[January 27th, 2026]: Two papers accepted to ICLR 2026.
[January 8th, 2026]: Invited by the Information Commissioner’s Office (ICO) to contribute a piece in their Tech Futures: Agentic AI report.

~~ End of 2025 ~~

[December 4th, 2025]: Awarded the Coefficient Giving Grant of ~ $260,000 to investigate transferable adversarial inputs capable of inducing malicious agent behaviour.
[DNovemberecember 14th, 2025]: Our Agent Trust Simulation project received the Outstanding Paper Award at the 1st Workshop on LLM Agents for Social Simulation (LASS) at CIKM 2025.
[December 3rd, 2025]: Our NeurIPS 2023 paper on tokenization’s impact on AI evaluation has been featured in the Hugging Face Evaluation Guidebook.
[November 4th, 2025]: Our new paper in NeurIPS25 on evaluating progress in AI safety evaluations has been featured by The Guardian and NBC News.
[November 3rd, 2025]: Featured on Globo TV’s “Digital Minds” segment — watch on Globoplay for our work on AI Safety. Globo is the largest media group in Latin America.
[September 18th, 2025]: Three Papers accepted to NeurIPS25.
[September 4th, 2025]: Our new work, paper/video, about hijacking agentic systems and demonstrating AI worms has been featured by the Scientific American.
[July 28th, 2025]: My department, Engineering Science at Oxford University, has awarded me the Excellence Award for outstanding performance throughout 2024/2025. This comes with a nice monetary gift.
[June 10th, 2025]: Our recent work on hijacking AI agents and creating an AI worm was featured on YouTube by Sabine Hossenfelder, a channel with over 1.7 million subscribers. More importantly, one that I watch frequently.
[May 1st, 2025]: One paper accepted to ICML 2025.
[April 3rd, 2025]: I was awarded the Systemic AI Safety grant (~$250,000) by the UK AI Security Institute. This was awarded to 20 applicants of a 451 (~4% acceptance rate).
[March 12th, 2025] I was interviewed by Al Ekhbariya Channel, which was broadcast on TV and radio, highlighting my recent state-of-the-art research and my academic journey that began at KAUST in Saudi Arabia; Watch interview on X.
[February 11th, 2025]: Four papers accepted to ICLR 2025; one paper of which as spotlight.
[September 25th, 2024]: Five papers accepted to NeurIPS 2024.
[September 20th, 2024]: I received the Google Gemma 2 Academic Program GCP Credit Award ($10,000).
[August 18th, 2024]: I joined Softserve as an R&D Distinguished Advisor.
[July 8th, 2024]: One paper accepted to MICCAI 2024.
[June 3rd, 2024]: One paper accepted to ECCV 2024.
[June 3rd, 2024]: Four papers accepted to ICML 2024. Special congrats to all students' lead authors. One accepted as Oral.
[May 28th, 2024]: Kumail Alhamoud is visiting me and Phil for 3 months this summer. Welcome, Kumail!
[May 23rd, 2024]: I am selected to serve as a Senior Area Chair for NeurIPS4.
[May 14th, 2024]: I was invited by the Rt Hon Deputy Prime Minister Oliver Dowden to be part of the official British Delegation (GREAT FUTURES) to Saudi Arabia on a trade expo towards improving collaborations on all fronts between the two nations.
[May 11th, 2024]: Media Coverage: Our recent paper on No “zero-shot” without exponential data has been covered and featured by Computerphile (~2.5 million subscribers) one youtube; see video here. Moreover, Sam Altman of OpenAI has also commented on reddit on our paper saying OpenAI is exploring similar directions. See details here.
[April 19th, 2024]: I was invited to give a number of talks in MBZUAI and the Oxford Robotics Institute.
[February 15th, 2024]: I gave a talk to TAHAKOM on AI Safety Research.
[February 13th, 2024]: I gave a talk at the AI Safety meeting in the Said Busienss School in Oxford in our work on AI Safety.
[February 7th, 2024]: Our paper SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training? on training CLIP with only synthetic data was advertised on TLDR News which has over 1M subscribers.
[January 21th, 2024]: Three papers accepted to ICLR24 (one as spotlight).

~~ End of 2023 ~~

[December 16th, 2023]: Our paper When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations received the Entropic Paper Award at the NeurIPS 2023 Workshop: I Can’t Believe It’s Not Better (ICBINB): Failure Modes in the Age of Foundation Models.
[December 12th, 2023]: Received a Notable Area Chair Award in NeurIPS23 (awarded to 8.1% of 1223).
[December 9th, 2023]: One paper (SimCS: Simulation for Domain Incremental Online Continual Segmentation) accepted to AAAI24.
[November 2nd, 2023]: I gave two talks on trustworthy AI to Kuwait University and Intematix, a startup based in Riyadh in Saudi Arabia.
[September 21st, 2023]: One paper on Language Model Tokenizers Introduce Unfairness Between Languages has been accepted to NeurIPS23.
[September 5th, 2023]: Invited by Dima Damen to give a talk at Bristol University.
[July 18th, 2023]: One robustness paper is accepted to TMLR.
[July 11th, 2023]: Our paper on Provably Correct Physics-Informed Neural Networks received the outstanding paper award in the Formal Verification Workshop in ICML23.
[July 14th, 2023]: One paper on online continual learning is accepted to ICCV 2023.
[July 11th, 2023]: Our paper Provably Correct Physics-Informed Neural Networks received an outstanding paper award at WFVML ICML23 workshop.
[June 23rd, 2023]: One paper accepted to DeployableGenerativeAI (ICML23 Workshop).
[June 20th, 2023]: Three papers accepted to AdvML-Frontiers (ICML23 Workshop).
[June 18th, 2023]: Invited to give a talk at the CLVision workshop in CVPR23. I was also part of the panel discussion.
[April 24th, 2023]: One paper accepted to ICML 2023.
[March 29th, 2023]: Four papers accepted to CLVision (CVPR23 Workshop).
[March 9th, 2023]: Media coverage by DOU, a Ukrainian development community focused on technologies, frameworks, and code, for my work with Taras Rumezhak.
[March 1st, 2023]: I am promoted to a Senior Researcher (G9) in Machine Learning and Computer Vision at Oxford.
[February 28th, 2023]: I will serve as an Area Chair for the NeurIPS 2023.
[February 27th, 2023]: Two papers accepted to CVPR 2023. One of the two papers is accepted as a highlight (2.5% of ~9K submissions).
[January 23th, 2023]: I will serve as an Area Chair for the ICLR 2023 Workshop on Trustworthy ML.

~~ End of 2022 ~~

[December 9th, 2022]: I will serve as a Senior Program Committee (Area Chair/Meta Reviewer) for IJCAI 2023.
[November 3rd, 2022]: Joined the ELLIS Society as a member.
[October 11th, 2022]: One paper accepted to WACV23.
[October 10th, 2022]: Received an Amazon Research Award.
[September 14th, 2022]: Our paper N-FGSM accepted to NeurIPS22.
[August 28th, 2022]: Our paper ANCER was accepted in Transactions on Machine Learning Research (TMLR).
[August 15th, 2022]: Our paper on Tropical Geometry was accepted to appear in the IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).
[July 11th, 2022]: I will serve as a Senior Program Committee (Area Chair/Meta Reviewer) for AAAI23.
[May 26th, 2022]: Three papers accepted to the Adversarial Machine Learning Frontiers ICML2022 workshop! Papers will be coming on arXiv soon.
[May 16th, 2022]: Our paper titled Data Dependent Randomzied Smoothing is accepted to UAI22.
[April 22nd, 2022]: I was selected as the Highlighted Reviewer of ICLR 2022 and received a free conference registration.

~~ End of 2021 ~~

[Dec 1st, 2021]: I got promoted to a Senior Research Associate (G8) in machine learning of the Torr Vision Group (TVG) at the University of Oxford.
[Dec 1st, 2021]: Two papers, Combating Adversaries with anti-adversaries and DeformRS: Certifying Input Deformations with Randomized Smoothing, are accepted in AAAI22.
[Nov 18th, 2021]: We were awarded KAUST’s Competitive Research Grant with a total of > 1.05M$ (one million USD). This is a 3 years collaboration between KAUST and Oxford.
[Oct 15th, 2021]: Rethinking Clustering for Robustness is accepted in BMVC21.
[June 22nd, 2021]: Anti Adversary paper is accepted in Adversarial Machine Learning Workshop @ICML21.
[June 13th, 2021]: I have been elected as a Junior Research Fellow of Kellogg College, University of Oxford. Appointment starts in October 2021.
[March 28th, 2021]: ETB robustness paper is accepted in RobustML Workshop @ICLR21.

~~ End of 2020 ~~

[November 8th, 2020]: New paper on robustness is on arXiv.
[November 8th, 2020]: New paper on randomized smoothing is on arXiv.
[October 15th, 2020]: I joined the Torr Vision Group working with Philip Torr at the University of Oxford.
[July 2nd, 2020]: Gabor layers enhance robustness paper accepted to ECCV20 arXiv.
[June 30th, 2020]: One paper is out on new expressions for the output moments of ReLU based networks with various new appliactions arXiv.
[June 24th, 2020]: New paper with SOTA results, backed with theory, on training robust models through feature clustering arXiv.
[March 31st, 2020]: I have sucessfully defended my PhD thesis.

~~ End of 2019 ~~

[Dec 20th, 2019]: One paper accepted to ICLR20.
[Nov 11th, 2019]: One spotlight paper accepted to AAAI20.
[Sept 25th, 2019]: Recognized as outstanding reviewer for ICCV19. Link.
[August 5th, 2019]: I was invited to give a talk about the most recent research in computer vision and machine learning from the IVUL group at PRIS19, Dead Sea, Jordan. I also gave a 1 hour long workshop about deep learning and pytorch. Slides1/Slides2/Material.
[July 6th, 2019]: I was invited to give a talk at the Eastern European Conference on Computer Vision, Odessa, Ukraine. Slides.
[June 28th, 2019]: I gave a talk at the Biomedical Computer Vision Group directed by Prof Pablo Arbelaez, Bogota, Colombia. Slides.
[June 15th, 2019]: Attended CVPR19.
[June 9th, 2019]: Recognized as an outstanding reviewer for CVPR19. This is the second time in a row for CVPR. Check it out. :)
[May 26th, 2019]: A new paper is out on derivative free optimization with momentum with new rates and results on continuous controls tasks. arXiv.
[May 25th, 2019]: New paper! New provably tight interval bounds are derived for DNNs. This allows for very simple robust training of large DNNs. arXiv.
[May 11th, 2019]: How to train robust networks outperforming 2-21x fold data augmentation? New paper out on arXiv.
[May 6th, 2019]: Attended ICLR19 in New Orleans.
[Feb 4th, 2019]: New paper on derivative-free optimization with importance sampling is out! Paper is on arXiv.

~~ End of 2018 ~~

[Dec 22nd, 2018]: One paper accepted to ICLR19, Louisiana, USA.
[Nov 6th, 2018]: One paper accepted to WACV19, Hawaii, USA.
[July 3rd, 2018]: One paper accepted to ECCV18, Munich, Germany.
[June 19th, 2018]: Attended CVPR18 and gave an oral talk on our most recent work on analyzing piecewise linear deep networks using Gaussian network moments. Tensorflow, Pytorch and MATLAB codes are released.
[June 17th, 2018]: Received a fully funded scholarship to attend the AI-DLDA 18 summer school in Udine, Italy. Unfortunately, I won’t be able to attend for time constraints. Link
[June 15th, 2018]: New paper out! “Improving SAGA via a Probabilistic Interpolation with Gradient Descent”.
[April 30th, 2018]: I’m interning for 6 months at the Intel Labs in Munich this summer with Vladlen Koltun.
[April 22nd, 2018]: Recognized as an outstanding reviewer for CVPR18. I’m also on the list of emergency reviewers. Check it out. :)
[March 6th, 2018]: One paper accepted as [Oral] in CVPR 2018.
[Feb 5, 2018]: Awarded the best KAUST poster prize in the Optimization and Big Data Conference.

~~ End of 2017 ~~

[Decemmber 11, 2017]: TCSC code is on github.
[October 22, 2017]: Attened ICCV17, Venice, Italy.
[July 22, 2017]: Attened CVPR17 in Hawaii and gave an oral presentation on our work on solving the LASSO with FFTs, July 2017.
[July 16, 2017]: FFTLasso’s code is available online.
[July 9, 2017]: Attended the ICVSS17, Sicily, Italy.
[June 15, 2017]: Selected to attend the International Computer Vision Summer School (ICVSS17), Sicily, Italy.
[March 17, 2017]: 1 paper accepted to ICCV17.
[March 14, 2017]: Received my NanoDegree on Deep Learning from Udacity.
[March 3, 2017]: 1 oral paper accepted to CVPR17, Hawai, USA.

~~ End of 2016 ~~

[October 19, 2016]: ECCV16’s code has been released on github.
[October 8, 2016]: Attended ECCV16, Amsterdam, Netherlands.
[July 11, 2016]: 1 spotlight paper accepted to ECCV16, Amsterdam, Netherlands.
[June 26, 2016]: Attended CVPR16, Las Vegas, USA. Two papers presented.
[May 13, 2016]: ICCVW15 code is now avaliable online.
[April 11, 2016]: Successfully defended my Master’s Thesis.
[March 2, 2016]: 2 papers (1 spotlight) accepted to CVPR16, Las Vegas, USA.

~~ End of 2015 ~~

[November 20, 2015]: 1 paper acceted to ICCVW15, Santiago, Chile.
[June 8, 2015]: Attended CVPR15, Boston, USA.

Grants & Funding

Research grants and funding received

KAUST-CRG Grant

King Abdullah University of Science and Technology (KAUST) Apr 2026

Awarded $180K of a total of ~$960K to work on OS Agentic Security, in collaboration with Bernard Ghanem and Mohammed El Hoseiny from KAUST.

Coefficient Giving Grant

Coefficient Dec 2025

~$260,000 to investigate transferable adversarial inputs capable of inducing malicious agent behaviour.

Systemic AI Safety Grant

UK AI Security Institute (AISI) Apr 2025

~$250,000 awarded to 20 recipients out of 451 applications (~4% acceptance rate), for work on Safe OS Agents.

See certificate

Toyota Motor Europe Grant

Toyota Motor Europe Jan 2025

£386,671.00 to work on agentic and multi-agent safety.

Google Gemma 2 Academic Award

Google Sep 2024

$10,000 GCP credits for research on certifying the output domain of Large Language Models.

Amazon Research Award

Amazon Oct 2022

$20,000 AWS credits in the Machine Learning Algorithms and Theory track, for the project “Randomized Smoothing; Future Directions and Extensions”.

See certificate

KAUST Competitive Research Grant

King Abdullah University of Science and Technology (KAUST) Nov 2021

$1,050,000 over 3 years for the project “Robust Continual Learning at Scale” (joint with Philip Torr, Bernard Ghanem, and Mohamed Elhoseiny).

Awards and Recognition

Best Paper Award at the MusIML Workshop, ICML 2026

Our paper “LLMs Struggle to Rank Products Robustly” received one of the first-ever MusIML Best Paper Awards at the ICML 2026 Workshop.

Adel Bibi

Last updated on Jul 19, 2026 1 min read

Shortlisted for Oxford University Vice-Chancellor's Breakthrough Researcher Award

Recognizing early-career researchers who have made a significant impact at the University, for breakthrough research in Agentic AI Safety and Security.

Adel Bibi

Last updated on Jun 5, 2026 1 min read

Shortlisted for Oxford University Vice-Chancellor's Breakthrough Researcher Award

Outstanding Paper Award at LASS Workshop, CIKM 2025

Our Agent Trust Simulation project received the Outstanding Paper Award at the 1st Workshop on LLM Agents for Social Simulation (LASS) at CIKM 2025.

Adel Bibi

Last updated on Mar 13, 2026 1 min read

Outstanding Paper Award at LASS Workshop, CIKM 2025

Coefficient Giving Grant

Awarded the Coefficient Giving Grant (~$260,000) to investigate transferable adversarial inputs that can induce malicious agent behaviour

Adel Bibi

Last updated on Mar 8, 2026 1 min read

Toyota Motor Europe Award 2025

Awarded the Toyota Motor Europe Grant (£386,671.00) in 2025 to work on agentic and multi-agent safety.

Adel Bibi

Last updated on Jul 19, 2026 1 min read

See all posts

Featured Publications

Karolina Korgul, Yushi Yang, Arkadiusz Drohomirecki, Piotr Blaszczyk, Will Howard, Lukas Aichberger, Chris Russell, Philip Torr, Adam Mahdi, Adel Bibi

April, 2026 International Conference on Machine Learning (ICML26)

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

Web-based agents powered by large language models are increasingly used for tasks such as email management or professional networking. Their reliance on dynamic web content, however, makes them vulnerable to prompt injection attacks; adversarial instructions hidden in interface elements that persuade the agent to divert from its original task. We introduce the Task-Redirecting Agent Persuasion Benchmark (TRAP), a benchmark for studying how persuasion techniques misguide autonomous web agents on realistic tasks. Across six frontier models, agents are susceptible to prompt injection in 25% of tasks on average (13% for GPT-5 to 43% for DeepSeek-R1), with small interface or contextual changes often doubling success rates and revealing systemic, psychologically driven vulnerabilities in web-based agents. We also provide a modular social-engineering injection framework with controlled experiments on high-fidelity website clones, allowing for further benchmark expansion.

Runqi Lin, Alasdair Paren, Suqin Yuan, Muyang Li, Philip Torr, Adel Bibi, Tongliang Liu

February, 2026 Computer Vision and Pattern Recognition (CVPR26)

FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction

The integration of new modalities enhances the capabilities of multimodal large language models (MLLMs) but also introduces additional vulnerabilities. In particular, simple visual jailbreaking attacks can manipulate open-source MLLMs more readily than sophisticated textual attacks. However, these underdeveloped attacks exhibit extremely limited cross-model transferability, failing to reliably identify vulnerabilities in closed-source MLLMs. In this work, we analyse the loss landscape of these jailbreaking attacks and find that the generated attacks tend to reside in high-sharpness regions, whose effectiveness is highly sensitive to even minor parameter changes during transfer. To further explain the high-sharpness localisations, we analyse their feature representations in both the intermediate layers and the spectral domain, revealing an improper reliance on narrow layer representations and semantically poor frequency components. Building on this, we propose a Feature Over-Reliance CorrEction (FORCE) method, which guides the attack to explore broader feasible regions across layer features and rescales the influence of frequency features according to their semantic content. By eliminating non-generalizable reliance on both layer and spectral features, our method discovers flattened feasible regions for visual jailbreaking attacks, thereby improving cross-model transferability. Extensive experiments demonstrate that our approach effectively facilitates visual red-teaming evaluations against closed-source MLLMs.

James Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, Fazl Barez

January, 2026 International Conference on Learning Representations (ICLR26)

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

Monitoring large language models' (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible–costs should rise only when inputs are difficult to assess, or when more compute is available. To achieve this, we introduce Truncated Polynomial Classifiers (TPCs), a natural extension of linear probes for dynamic activation monitoring. Our key insight is that polynomials can be trained and evaluated progressively, term-by-term. At test-time, one can early-stop for lightweight monitoring, or use more terms for stronger guardrails when needed. TPCs provide two modes of use. First, as a safety dial: by evaluating more terms, developers and regulators can buy stronger guardrails from the same model. Second, as an adaptive cascade: clear cases exit early after low-order checks, and higher-order guardrails are evaluated only for ambiguous inputs, reducing overall monitoring costs. On two large-scale safety datasets (WildGuardMix and BeaverTails), for 4 models with up to 30B parameters, we show that TPCs compete with or outperform MLP-based probe baselines of the same size, all the while being more interpretable than their black-box counterparts.

See all publications