Adel Bibi is a senior researcher in machine learning and computer vision at the Department of Engineering Science of the University of Oxford, a Research Fellow (JRF) at Kellogg College, and a member of the ELLIS Society. Bibi is an R&D Distinguished Advisor with Softserve. Previously, Bibi was a senior research associate and a postdoctoral researcher with Philip H.S. Torr since October 2020. He received his MSc and PhD degrees from King Abdullah University of Science & Technology (KAUST) in 2016 and 2020, respectively, advised by Bernard Ghanem. Bibi was awarded an Amazon Research Award in 2022 in the Machine Learning Algorithms and Theory track in addition to being awarded the Google Gemma 2 Academic Award in 2024. Bibi received four best paper awards; a NeurIPS23 workshop, an ICML23 workshop, a 2022 CVPR workshop, and one at the Optimization and Big Data Conference in 2018. His contributions include over 30 papers published in top machine learning and computer vision conferences. He also received four outstanding reviewer awards (CVPR18, CVPR19, ICCV19, ICLR22) and a Notable Area Chair Award in NeurIPS23.
Currently, Bibi is leading a group in Oxford focusing on the intersection between AI safety of large foundational models in both vision and language (covering topics such as robustness, certification, alignment, adversarial elicitation, etc.) and the efficient continual update of these models.
Download my resume
[Note!] I am always looking for strong self-motivated PhD students. If you are interested in AI Safety, Trustworthy, and Security of AI models and Agentic AI, reach out!
[Consulting Expertise] I have consulted in the past on projects spanning core machine learning and data science, computer vision, certification and AI safety, optimization formulations for matching and resource allocation problems, among other areas.
PhD in Electrical Engineering (4.0/4.0); Machine Learning and Optimization Track, 2020
King Abdullah University of Science and Technology (KAUST)
MSc in Electrical Engineering (4.0/4.0); Computer Vision Track, 2016
King Abdullah University of Science and Technology (KAUST)
BSc in Electrical Engineering (3.99/4.0), 2014
Kuwait University
~~ End of 2023 ~~
~~ End of 2022 ~~
~~ End of 2021 ~~
~~ End of 2020 ~~
~~ End of 2019 ~~
~~ End of 2018 ~~
~~ End of 2017 ~~
~~ End of 2016 ~~
~~ End of 2015 ~~
Web-crawled pretraining datasets underlie the impressive “zero-shot” evaluation performance of multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of “zero-shot” generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream concepts targeted for during “zero-shot” evaluation. In this work, we ask; How is the performance of multimodal models on downstream concepts influenced by the frequency of these concepts in their pretraining datasets? We comprehensively investigate this question across 34 models and five standard pretraining datasets (CC-3M, CC-12M, YFCC-15M, LAION-400M, LAION-Aesthetics), generating over 300GB of data artifacts. We consistently find that, far from exhibiting “zero-shot” generalization, multimodal models require exponentially more data to achieve linear improvements in downstream “zero-shot” performance, following a sample inefficient log-linear scaling trend. This trend persists even when controlling for sample-level similarity between pretraining and downstream datasets, and testing on purely synthetic data distributions. Furthermore, upon benchmarking models on long-tailed data sampled based on our analysis, we demonstrate that multimodal models across the board perform poorly. We contribute this long-tail test set as the “Let it Wag!” benchmark to further research in this direction. Taken together, our study reveals an exponential need for training data which implies that the key to “zero-shot” generalization capabilities under large-scale training paradigms remains to be found.
Standardized benchmarks drive progress in machine learning. However, with repeated testing, the risk of overfitting grows as algorithms over-exploit benchmark idiosyncrasies. In our work, we seek to mitigate this challenge by compiling ever-expanding large-scale benchmarks called Lifelong Benchmarks. These benchmarks introduce a major challenge; the high cost of evaluating a growing number of models across very large sample sets. To address this challenge, we introduce an efficient framework for model evaluation, Sort & Search (S&S)}, which reuses previously evaluated models by leveraging dynamic programming algorithms to selectively rank and sub-select test samples. To test our approach at scale, we create Lifelong-CIFAR10 and Lifelong-ImageNet, containing 1.69M and 1.98M test samples for classification. Extensive empirical evaluations across over 31,000 models demonstrate that S&S achieves highly-efficient approximate accuracy measurement, reducing compute cost from 180 GPU days to 5 GPU hours (about 1000x reduction) on a single A100 GPU, with low approximation error and memory cost of <100MB. Our work also highlights issues with current accuracy prediction metrics, suggesting a need to move towards sample-level evaluation metrics. We hope to guide future research by showing our method’s bottleneck lies primarily in generalizing Sort beyond a single rank order and not in improving Search.
Online continual learning, the process of training models on streaming data, has gained increasing attention in recent years. However, a critical aspect often overlooked is the label delay, where new data may not be labeled due to slow and costly annotation processes. We introduce a new continual learning framework with explicit modeling of the label delay between data and label streams over time steps. In each step, the framework reveals both unlabeled data from the current time step t and labels delayed with d steps, from the time step t−d. In our extensive experiments amounting to 1060 GPU days, we show that merely augmenting the computational resources is insufficient to tackle this challenge. Our findings underline a notable performance decline when solely relying on labeled data when the label delay becomes significant. More surprisingly, when using state-of-the-art SSL and TTA techniques to utilize the newer, unlabeled data, they fail to surpass the performance of a naïve method that simply trains on the delayed supervised stream. To this end, we introduce a simple, efficient baseline that rehearses from the labeled memory samples that are most similar to the new unlabeled samples. This method bridges the accuracy gap caused by label delay without significantly increasing computational complexity. We show experimentally that our method is the least affected by the label delay factor and in some cases successfully recovers the accuracy of the non-delayed counterpart. We conduct various ablations and sensitivity experiments, demonstrating the effectiveness of our approach.