How to Use OpenAI CLIP for Image Classification

Image classification is a fundamental task in computer vision that involves assigning labels or categories to images based on their visual content. OpenAI CLIP is a powerful deep learning model that combines vision and language to perform various tasks, including image classification.

In this tutorial, we will explore how to use OpenAI CLIP for image classification. We will cover the following steps:

Installing the necessary libraries and dependencies
Loading the pre-trained CLIP model
Preprocessing images for classification
Classifying images using CLIP
Examining the classification results

Let’s get started!

1. Installing the necessary libraries and dependencies

To use OpenAI CLIP for image classification, we will need the following libraries:

OpenAI CLIP: A deep learning model that combines vision and language.
Torchvision: A library that provides easy access to pre-trained models and datasets in PyTorch.

To install these libraries, open a terminal or command prompt and run the following command:

pip install torch torchvision
pip install git+https://github.com/openai/CLIP.git

Make sure you have Python and pip installed on your system before running the above commands.

2. Loading the pre-trained CLIP model

Once the required libraries are installed, we can proceed to load the pre-trained CLIP model. The pre-trained model can be downloaded using the torch.hub module from Torchvision.

Create a new Python script or Jupyter Notebook and import the necessary libraries:

import torch
import torchvision.transforms as transforms
import clip

Next, load the pre-trained CLIP model using the torch.hub.load function:

model, preprocess = clip.load("ViT-B/32", device="cuda")

Here, we are loading the CLIP model with the variant “ViT-B/32”. You can choose different variants depending on your requirements. The device argument specifies the device to use for running the model. Change it to “cpu” if you don’t have a CUDA-enabled GPU.

3. Preprocessing images for classification

Before we can classify images using CLIP, we need to preprocess them by applying suitable transformations. CLIP expects the input images to be of size 224×224 pixels and normalized to a specific range of values.

To preprocess the images, we will use the transforms.Compose function from Torchvision. Add the following code to your script:

preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

In the above code, we define a series of transformations to be applied to the input images. We resize the images to 256×256 pixels, then crop the center region to get a size of 224×224 pixels. Next, we convert the images to tensors and normalize them using the mean and standard deviation values provided.

4. Classifying images using CLIP

Now that we have loaded the pre-trained CLIP model and defined the image preprocessing steps, we can proceed to classify images. To classify an image, we need to follow the following steps:

Load and preprocess the image.
Encode the image using the CLIP model.
Prepare a list of texts or labels for classification.
Encode the texts using the CLIP model.
Calculate the similarity between the image and each label.

Let’s go through these steps one by one.

4.1 Load and preprocess the image

To classify an image, we first need to load and preprocess it using the transformations defined earlier. Add the following code to your script:

import PIL

image_path = "path/to/your/image.jpg"  # Replace with the path to your image
image = PIL.Image.open(image_path).convert("RGB")
image = preprocess(image).unsqueeze(0).to("cuda")

Make sure to replace "path/to/your/image.jpg" with the actual path to your image.

4.2 Encode the image using the CLIP model

After preprocessing the image, we need to encode it using the CLIP model. Encoding is the process of converting the image into a fixed-length vector representation that captures its visual content.

Add the following code to your script to encode the image:

with torch.no_grad():
    image_features = model.encode_image(image)

The encode_image method of the CLIP model takes the preprocessed image as input and returns its encoded representation. We wrap the code in a torch.no_grad() block to ensure that no gradients are computed during the forward pass.

4.3 Prepare a list of texts or labels for classification

To classify the image with CLIP, we need to provide a list of texts or labels that represent the categories we want to classify the image into. Create a list of labels as shown below:

labels = ["cat", "dog", "tree", "car"]

Feel free to add or remove labels based on the categories you are interested in.

4.4 Encode the texts using the CLIP model

Once we have the image encoded, we need to encode the labels using the same CLIP model. Encoding the labels converts them into fixed-length vectors that capture their semantic meaning.

Add the following code to your script to encode the labels:

with torch.no_grad():
    label_features = model.encode_text(clip.tokenize(labels).to("cuda"))

The encode_text method of the CLIP model takes a list of tokenized texts as input and returns their encoded representations.

4.5 Calculate the similarity between the image and each label

Finally, we can calculate the similarity between the image and each label to perform classification. The similarity is computed using the cosine similarity metric, which measures the angle between two vectors. A higher similarity score indicates a closer match between the image and label.

Add the following code to your script to calculate the similarity:

with torch.no_grad():
    similarities = (100.0 * image_features @ label_features.T).softmax(dim = -1)

for i, label in enumerate(labels):
    print(f"Image is {100 * similarities[0, i]:.2f}% {label}")

In the above code, we compute the cosine similarity between the image and each label using the dot product (@) and softmax functions. The resulting similarity scores are then printed along with the corresponding label.

5. Examining the classification results

After classifying the image using CLIP, we can examine the results to see how well the model performed. The similarity scores printed in the previous step indicate the classification confidence for each label.

You can experiment with different images and labels to observe the model’s behavior. Keep in mind that the pre-trained CLIP model has been trained on a large amount of data and may perform better for certain categories or domains.

Here are a few additional tips for improving classification results:

Use more labels: Providing a larger list of labels can help the model differentiate between categories more effectively.
Fine-tune the model: If you have a labeled dataset, you can fine-tune the CLIP model on your data to improve its performance on specific tasks.
Train your own classifier: Use the encoded representations of images and labels to train a separate classifier on your specific task.

That’s it! You have learned how to use OpenAI CLIP for image classification. This tutorial covered the installation of required libraries, loading the pre-trained CLIP model, preprocessing images for classification, classifying images using CLIP, and examining the classification results.

OpenAI CLIP is a powerful tool that combines vision and language to perform various tasks in computer vision. It can be used out-of-the-box for image classification and serves as a great starting point for more complex tasks. Happy classifying!