Update README.md with new model card content

9e25465 verified about 1 year ago

6.05 kB

	---
	library_name: keras-hub
	---
	### Model Overview
	# Model Summary

	This model is a CLIP (Contrastive Language-Image Pre-training) neural network. CLIP revolutionizes image understanding by learning visual concepts from natural language descriptions found online. It's been trained on a massive dataset of image-text pairs, allowing it to excel at tasks like zero-shot image classification, image search based on text queries, and robust visual understanding. With CLIP, you can explore the power of aligning image and text representations within a shared embedding space.


	Weights are released under the [MIT License](https://opensource.org/license/mit). Keras model code is released under the [Apache 2 License](https://github.com/keras-team/keras-hub/blob/master/LICENSE).

	## Links

	* [CLIP Quickstart Notebook](https://www.kaggle.com/code/laxmareddypatlolla/clip-quickstart-notebook)
	* [CLIP API Documentation](https://keras.io/keras_hub/api/models/clip/)
	* [CLIP Model Card](https://huggingface.co/docs/transformers/en/model_doc/clip)
	* [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/)
	* [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/)

	## Installation

	Keras and KerasHub can be installed with:

	```
	pip install -U -q keras-hub
	pip install -U -q keras

	```

	Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page.

	## Presets

	The following model checkpoints are provided by the Keras team. Full code examples for each are available below.
	\| Preset name \| Parameters \| Description \|
	\|----------------------------\|------------\|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|
	\| clip-vit-base-patch16 \| 149.62M \| The model uses a ViT-B/16 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The model uses a patch size of 16 and input images of size (224, 224) \|
	\| clip-vit-base-patch32 \| 151.28M \| The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 32 and input images of size (224, 224) \|
	\| clip-vit-large-patch14 \| 427.62M \| The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (224, 224) \|
	\| clip-vit-large-patch14-336 \| 427.94M \| The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (336, 336) \|
	\| clip_vit_b_32_laion2b_s34b_b79k \| 151.28M \| 151 million parameter, 12-layer for vision and 12-layer for text, patch size of 32, Open CLIP model. \|
	\| clip_vit_h_14_laion2b_s32b_b79k \| 986.11M \| 986 million parameter, 32-layer for vision and 24-layer for text, patch size of 14, Open CLIP model. \|
	\| clip_vit_g_14_laion2b_s12b_b42k \| 1.37B \| 1.4 billion parameter, 40-layer for vision and 24-layer for text, patch size of 14, Open CLIP model. \|
	\| clip_vit_bigg_14_laion2b_39b_b160k \| 2.54B \| 2.5 billion parameter, 48-layer for vision and 32-layer for text, patch size of 14, Open CLIP model. \|

	## Example Usage
	```python
	import keras
	import numpy as np
	import matplotlib.pyplot as plt
	from keras_hub.models import CLIPBackbone, CLIPTokenizer
	from keras_hub.layers import CLIPImageConverter

	# instantiate the model and preprocessing tools
	clip = CLIPBackbone.from_preset("clip_vit_large_patch14_336")
	tokenizer = CLIPTokenizer.from_preset("clip_vit_large_patch14_336",
	sequence_length=5)
	image_converter = CLIPImageConverter.from_preset("clip_vit_large_patch14_336")

	# obtain tokens for some input text
	tokens = tokenizer.tokenize(["mountains", "cat on tortoise", "house"])

	# preprocess image and text
	image = keras.utils.load_img("cat.jpg")
	image = image_converter(np.array([image]).astype(float))

	# query the model for similarities
	clip({
	"images": image,
	"token_ids": tokens,
	})
	```

	## Example Usage with Hugging Face URI

	```python
	import keras
	import numpy as np
	import matplotlib.pyplot as plt
	from keras_hub.models import CLIPBackbone, CLIPTokenizer
	from keras_hub.layers import CLIPImageConverter

	# instantiate the model and preprocessing tools
	clip = CLIPBackbone.from_preset("hf://keras/clip_vit_large_patch14_336")
	tokenizer = CLIPTokenizer.from_preset("hf://keras/clip_vit_large_patch14_336",
	sequence_length=5)
	image_converter = CLIPImageConverter.from_preset("hf://keras/clip_vit_large_patch14_336")

	# obtain tokens for some input text
	tokens = tokenizer.tokenize(["mountains", "cat on tortoise", "house"])

	# preprocess image and text
	image = keras.utils.load_img("cat.jpg")
	image = image_converter(np.array([image]).astype(float))

	# query the model for similarities
	clip({
	"images": image,
	"token_ids": tokens,
	})
	```