CTP: Contrastive Tensor Pre-training
This repository contains the model checkpoints for CTP (Contrastive Tensor Pre-training). While CLIP focuses on aligning two modalities (Image and Text), CTP introduces a unified framework to align multiple modalities (Image, Text, and Point Cloud) simultaneously using tensor-based alignment.
Repository Structure
The checkpoints are organized by experiment configuration. We use the following naming conventions:
all: Pre-training of all three encoders (CLIP ViT, CLIP Text, and PointNet++).pc: Only the PointNet++ (Point Cloud) backbone is trained; Image and Text encoders remain frozen.nm: "No Masked" variant (ablation study).
Checkpoint Variations
| Folder Name | Method Description | Alignment Strategy |
|---|---|---|
192_l2_tensor_all |
Default | L2 Similarity Tensor |
192_l2_tensor_nm_all |
Default (No Masking) | L2 Similarity Tensor |
192_l2_tensor_pc |
Frozen Image/Text | L2 Similarity Tensor |
192_cos_tensor_all |
Cosine Variant | Cosine Similarity Tensor |
192_cos_matrix_all |
Pairwise Matrix | 3× Pairwise Similarity Matrices |
192_cos_matrix_pc |
Pairwise (Frozen) | 3× Pairwise Similarity Matrices |
192_cos_matrix_IP_pc |
Image-Point Only | 1× Similarity Matrix (I-L) |
How to Load the Models
You can load these checkpoints directly into your PyTorch environment using the huggingface_hub library.
Prerequisites
pip install torch huggingface_hub
Loading a Specific Checkpoint
import torch
from huggingface_hub import hf_hub_download
REPO_ID = "Ximeng0831/CTP"
# Example: Loading the default proposed model
SUBFOLDER = "pointnet2/192_l2_cube_all"
FILENAME = "ckpt_epoch9.pt"
checkpoint_path = hf_hub_download(
repo_id=REPO_ID,
subfolder=SUBFOLDER,
filename=FILENAME
)
# Load CTP model
# model = ctp(text_encoder, image_encoder, lidar_encoder, loss_fn)
checkpoint = torch.load(checkpoint_path, map_location="cpu")
model.load_state_dict(checkpoint)
model.eval()
Source code: https://github.com/TAMU-CVRL/CTP
Training Configurations
Detailed configuration files (YAML) for each experiment are available in the Official GitHub Repository.
all: Training is performed for 10 epochs with a total batch size of 384. These models are trained using two NVIDIA A100 (40G) GPUs.pc: Training is conducted for 20 epochs with a batch size of 192. These models are trained on a single NVIDIA RTX 4090 GPU.
Note: For specific hyperparameter settings such as learning rate schedules and weight decay, please refer to the corresponding
.yamlfiles in the link above.