Classifying Elephants
For my lectures in Computer Vision I have used the Elephant dataset. Here we use four AI models, with very different AI architectures, to classify Elephants.
My students often fail to understand that AI models are very bad (compared to humans) to generalize. 'Generalization' is something where humans excel: show an image of a real elephant to a 3 year old child and it will recognize a drawing of an elephant instantly.
The traditional model architectures, like ResNet or a Vision Transformer, trained on ImageNet, are quite bad in recognizing drawings of elephants.
The more modern models, trained on very large web-scale datasets, are much better in recognizing drawings of elephants.
Models
The following models were used:
- ResNet
- Vision Transformer
- CLIP
- Florence2
Please see the results and a comparision of the models below.
| Model | Classified as elephant | Dataset/size | Model Size | Remarks |
|---|---|---|---|---|
| ResNet (2015) | 5 / 15 | ImageNet 1.4 M images | ? | |
| ViT (2020) | 5 / 15 | ImageNet 1.4 M images | 346Mb | |
| CLIP (2022) | 8 / 15 | 400 M images | ? | Dataset not published |
| Florence2 (2024) | 13 / 15 | 129 M images | 1.5 Gb | Highly curated dataset ±5B annotations |
It needs further analysis and critical evaluation why the newer models CLIP and Florence2 are better in generalisation than the older models.
Links
Colabs: https://drive.google.com/drive/folders/1rKMTRmqcLBpwHoXoTAfq0bjF7tR9QSrV
Dataset: https://huggingface.co/datasets/MichielBontenbal/elephants