Instructions to use keras/clip_vit_large_patch14_336 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- KerasHub
How to use keras/clip_vit_large_patch14_336 with KerasHub:
import keras_hub # Create a Backbone model unspecialized for any task backbone = keras_hub.models.Backbone.from_preset("hf://keras/clip_vit_large_patch14_336") - Keras
How to use keras/clip_vit_large_patch14_336 with Keras:
# Available backend options are: "jax", "torch", "tensorflow". import os os.environ["KERAS_BACKEND"] = "jax" import keras model = keras.saving.load_model("hf://keras/clip_vit_large_patch14_336") - Notebooks
- Google Colab
- Kaggle
| library_name: keras-hub | |
| ### Model Overview | |
| # Model Summary | |
| This model is a CLIP (Contrastive Language-Image Pre-training) neural network. CLIP revolutionizes image understanding by learning visual concepts from natural language descriptions found online. It's been trained on a massive dataset of image-text pairs, allowing it to excel at tasks like zero-shot image classification, image search based on text queries, and robust visual understanding. With CLIP, you can explore the power of aligning image and text representations within a shared embedding space. | |
| Weights are released under the [MIT License](https://opensource.org/license/mit). Keras model code is released under the [Apache 2 License](https://github.com/keras-team/keras-hub/blob/master/LICENSE). | |
| ## Links | |
| * [CLIP Quickstart Notebook](https://www.kaggle.com/code/laxmareddypatlolla/clip-quickstart-notebook) | |
| * [CLIP API Documentation](https://keras.io/keras_hub/api/models/clip/) | |
| * [CLIP Model Card](https://huggingface.co/docs/transformers/en/model_doc/clip) | |
| * [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/) | |
| * [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/) | |
| ## Installation | |
| Keras and KerasHub can be installed with: | |
| ``` | |
| pip install -U -q keras-hub | |
| pip install -U -q keras | |
| ``` | |
| Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page. | |
| ## Presets | |
| The following model checkpoints are provided by the Keras team. Full code examples for each are available below. | |
| | Preset name | Parameters | Description | | |
| |----------------------------|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | |
| | clip-vit-base-patch16 | 149.62M | The model uses a ViT-B/16 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The model uses a patch size of 16 and input images of size (224, 224) | | |
| | clip-vit-base-patch32 | 151.28M | The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 32 and input images of size (224, 224) | | |
| | clip-vit-large-patch14 | 427.62M | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (224, 224) | | |
| | clip-vit-large-patch14-336 | 427.94M | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (336, 336) | | |
| | clip_vit_b_32_laion2b_s34b_b79k | 151.28M | 151 million parameter, 12-layer for vision and 12-layer for text, patch size of 32, Open CLIP model. | | |
| | clip_vit_h_14_laion2b_s32b_b79k | 986.11M | 986 million parameter, 32-layer for vision and 24-layer for text, patch size of 14, Open CLIP model. | | |
| | clip_vit_g_14_laion2b_s12b_b42k | 1.37B | 1.4 billion parameter, 40-layer for vision and 24-layer for text, patch size of 14, Open CLIP model. | | |
| | clip_vit_bigg_14_laion2b_39b_b160k | 2.54B | 2.5 billion parameter, 48-layer for vision and 32-layer for text, patch size of 14, Open CLIP model. | | |
| ## Example Usage | |
| ```python | |
| import keras | |
| import numpy as np | |
| import matplotlib.pyplot as plt | |
| from keras_hub.models import CLIPBackbone, CLIPTokenizer | |
| from keras_hub.layers import CLIPImageConverter | |
| # instantiate the model and preprocessing tools | |
| clip = CLIPBackbone.from_preset("clip_vit_large_patch14_336") | |
| tokenizer = CLIPTokenizer.from_preset("clip_vit_large_patch14_336", | |
| sequence_length=5) | |
| image_converter = CLIPImageConverter.from_preset("clip_vit_large_patch14_336") | |
| # obtain tokens for some input text | |
| tokens = tokenizer.tokenize(["mountains", "cat on tortoise", "house"]) | |
| # preprocess image and text | |
| image = keras.utils.load_img("cat.jpg") | |
| image = image_converter(np.array([image]).astype(float)) | |
| # query the model for similarities | |
| clip({ | |
| "images": image, | |
| "token_ids": tokens, | |
| }) | |
| ``` | |
| ## Example Usage with Hugging Face URI | |
| ```python | |
| import keras | |
| import numpy as np | |
| import matplotlib.pyplot as plt | |
| from keras_hub.models import CLIPBackbone, CLIPTokenizer | |
| from keras_hub.layers import CLIPImageConverter | |
| # instantiate the model and preprocessing tools | |
| clip = CLIPBackbone.from_preset("hf://keras/clip_vit_large_patch14_336") | |
| tokenizer = CLIPTokenizer.from_preset("hf://keras/clip_vit_large_patch14_336", | |
| sequence_length=5) | |
| image_converter = CLIPImageConverter.from_preset("hf://keras/clip_vit_large_patch14_336") | |
| # obtain tokens for some input text | |
| tokens = tokenizer.tokenize(["mountains", "cat on tortoise", "house"]) | |
| # preprocess image and text | |
| image = keras.utils.load_img("cat.jpg") | |
| image = image_converter(np.array([image]).astype(float)) | |
| # query the model for similarities | |
| clip({ | |
| "images": image, | |
| "token_ids": tokens, | |
| }) | |
| ``` | |