Adel Bibi is a Senior Researcher in Machine Learning and Computer Vision at the Department of Engineering Science at the University of Oxford, a Research Member of the Common Room and a former Junior Fellow at Kellogg College, and a member of the ELLIS Society. He is also an R&D Distinguished Advisor at SoftServe. Previously, he was a Senior Research Associate and postdoctoral researcher in the Torr Vision Group at Oxford working with Philip H.S. Torr. He obtained his MSc and PhD from KAUST in 2016 and 2020, respectively, under the supervision of Bernard Ghanem.
His research focuses on trustworthy and robust AI, agentic AI safety, and adversarial machine learning. His recent work on vulnerabilities in AI agents, transferable adversarial attacks, and AI safety evaluations has received broad international attention, including coverage by Scientific American, The Guardian, NBC News, Computerphile, TLDR News, Globo TV, and the Hugging Face Evaluation Guidebook. His work has received multiple paper and service distinctions, including four best/outstanding workshop paper awards (NeurIPS'23, ICML'23, CVPR'22, OBD'18), four outstanding reviewer awards (CVPR18, CVPR19, ICCV19, ICLR22), and a Notable Area Chair Award at NeurIPS 2023. He regularly serves as a Senior Area Chair and Area Chair for major AI conferences including NeurIPS, ICML, ICLR, etc. Bibi has authored more than 50 publications in top-tier machine learning and computer vision venues, including NeurIPS, ICML, ICLR, CVPR, ICCV, ECCV, AAAI, and TPAMI.
Download my resume
[Note!] I am always looking for strong self-motivated PhD students. If you are interested in AI Safety, Trustworthy, and Security of AI models and Agentic AI, reach out!
[Consulting Expertise] I have consulted in the past on projects spanning core machine learning and data science, computer vision, certification and AI safety, optimization formulations for matching and resource allocation problems, among other areas.
King Abdullah University of Science and Technology (KAUST)
PhD in Electrical Engineering (4.0/4.0), 2020
Machine Learning & Optimization Track
MSc in Electrical Engineering (4.0/4.0), 2016
Computer Vision Track
BSc in Electrical Engineering (3.99/4.0), 2014
Senior Researcher (PI)
Mar 2023 – Present
Senior Research Associate
Dec 2021 – Feb 2023
Postdoctoral Research Assistant
Oct 2020 – Dec 2021
R&D Distinguished Advisor
Aug 2024 – Present
Chief AI Officer
Feb 2023 – Present
PhD Research Intern
Jun – Nov 2018
Executive Director
Aug 2025 – Present
~~ End of 2025 ~~
~~ End of 2023 ~~
~~ End of 2022 ~~
~~ End of 2021 ~~
~~ End of 2020 ~~
~~ End of 2019 ~~
~~ End of 2018 ~~
~~ End of 2017 ~~
~~ End of 2016 ~~
~~ End of 2015 ~~
Research grants and funding received
Web-based agents powered by large language models are increasingly used for tasks such as email management or professional networking. Their reliance on dynamic web content, however, makes them vulnerable to prompt injection attacks; adversarial instructions hidden in interface elements that persuade the agent to divert from its original task. We introduce the Task-Redirecting Agent Persuasion Benchmark (TRAP), a benchmark for studying how persuasion techniques misguide autonomous web agents on realistic tasks. Across six frontier models, agents are susceptible to prompt injection in 25% of tasks on average (13% for GPT-5 to 43% for DeepSeek-R1), with small interface or contextual changes often doubling success rates and revealing systemic, psychologically driven vulnerabilities in web-based agents. We also provide a modular social-engineering injection framework with controlled experiments on high-fidelity website clones, allowing for further benchmark expansion.
The integration of new modalities enhances the capabilities of multimodal large language models (MLLMs) but also introduces additional vulnerabilities. In particular, simple visual jailbreaking attacks can manipulate open-source MLLMs more readily than sophisticated textual attacks. However, these underdeveloped attacks exhibit extremely limited cross-model transferability, failing to reliably identify vulnerabilities in closed-source MLLMs. In this work, we analyse the loss landscape of these jailbreaking attacks and find that the generated attacks tend to reside in high-sharpness regions, whose effectiveness is highly sensitive to even minor parameter changes during transfer. To further explain the high-sharpness localisations, we analyse their feature representations in both the intermediate layers and the spectral domain, revealing an improper reliance on narrow layer representations and semantically poor frequency components. Building on this, we propose a Feature Over-Reliance CorrEction (FORCE) method, which guides the attack to explore broader feasible regions across layer features and rescales the influence of frequency features according to their semantic content. By eliminating non-generalizable reliance on both layer and spectral features, our method discovers flattened feasible regions for visual jailbreaking attacks, thereby improving cross-model transferability. Extensive experiments demonstrate that our approach effectively facilitates visual red-teaming evaluations against closed-source MLLMs.
Monitoring large language models' (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible–costs should rise only when inputs are difficult to assess, or when more compute is available. To achieve this, we introduce Truncated Polynomial Classifiers (TPCs), a natural extension of linear probes for dynamic activation monitoring. Our key insight is that polynomials can be trained and evaluated progressively, term-by-term. At test-time, one can early-stop for lightweight monitoring, or use more terms for stronger guardrails when needed. TPCs provide two modes of use. First, as a safety dial: by evaluating more terms, developers and regulators can buy stronger guardrails from the same model. Second, as an adaptive cascade: clear cases exit early after low-order checks, and higher-order guardrails are evaluated only for ambiguous inputs, reducing overall monitoring costs. On two large-scale safety datasets (WildGuardMix and BeaverTails), for 4 models with up to 30B parameters, we show that TPCs compete with or outperform MLP-based probe baselines of the same size, all the while being more interpretable than their black-box counterparts.