CTP: Contrastive Tensor Pre-training

arXiv Hugging Face Model Hugging Face Dataset GitHub

This repository contains the model checkpoints for CTP (Contrastive Tensor Pre-training). While CLIP focuses on aligning two modalities (Image and Text), CTP introduces a unified framework to align multiple modalities (Image, Text, and Point Cloud) simultaneously using tensor-based alignment.

Repository Structure

The checkpoints are organized by experiment configuration. We use the following naming conventions:

  • all: Pre-training of all three encoders (CLIP ViT, CLIP Text, and PointNet++).
  • pc: Only the PointNet++ (Point Cloud) backbone is trained; Image and Text encoders remain frozen.
  • nm: "No Masked" variant (ablation study).

Checkpoint Variations

Folder Name Method Description Alignment Strategy
192_l2_tensor_all Default L2 Similarity Tensor
192_l2_tensor_nm_all Default (No Masking) L2 Similarity Tensor
192_l2_tensor_pc Frozen Image/Text L2 Similarity Tensor
192_cos_tensor_all Cosine Variant Cosine Similarity Tensor
192_cos_matrix_all Pairwise Matrix 3× Pairwise Similarity Matrices
192_cos_matrix_pc Pairwise (Frozen) 3× Pairwise Similarity Matrices
192_cos_matrix_IP_pc Image-Point Only 1× Similarity Matrix (I-L)

How to Load the Models

You can load these checkpoints directly into your PyTorch environment using the huggingface_hub library.

Prerequisites

pip install torch huggingface_hub

Loading a Specific Checkpoint

import torch
from huggingface_hub import hf_hub_download

REPO_ID = "Ximeng0831/CTP"
# Example: Loading the default proposed model
SUBFOLDER = "pointnet2/192_l2_cube_all"
FILENAME = "ckpt_epoch9.pt"

checkpoint_path = hf_hub_download(
    repo_id=REPO_ID,
    subfolder=SUBFOLDER,
    filename=FILENAME
)

# Load CTP model
# model = ctp(text_encoder, image_encoder, lidar_encoder, loss_fn)
checkpoint = torch.load(checkpoint_path, map_location="cpu")
model.load_state_dict(checkpoint)
model.eval()

Source code: https://github.com/TAMU-CVRL/CTP

Training Configurations

Detailed configuration files (YAML) for each experiment are available in the Official GitHub Repository.

  • all: Training is performed for 10 epochs with a total batch size of 384. These models are trained using two NVIDIA A100 (40G) GPUs.
  • pc: Training is conducted for 20 epochs with a batch size of 192. These models are trained on a single NVIDIA RTX 4090 GPU.

Note: For specific hyperparameter settings such as learning rate schedules and weight decay, please refer to the corresponding .yaml files in the link above.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for Ximeng0831/CTP