CTP: Contrastive Tensor Pre-training

This repository contains the model checkpoints for CTP (Contrastive Tensor Pre-training). While CLIP focuses on aligning two modalities (Image and Text), CTP introduces a unified framework to align multiple modalities (Image, Text, and Point Cloud) simultaneously using tensor-based alignment.

Repository Structure

The checkpoints are organized by experiment configuration. We use the following naming conventions:

all: Pre-training of all three encoders (CLIP ViT, CLIP Text, and PointNet++).
pc: Only the PointNet++ (Point Cloud) backbone is trained; Image and Text encoders remain frozen.
nm: "No Masked" variant (ablation study).

Checkpoint Variations

Folder Name	Method Description	Alignment Strategy
`192_l2_tensor_all`	Default	L2 Similarity Tensor
`192_l2_tensor_nm_all`	Default (No Masking)	L2 Similarity Tensor
`192_l2_tensor_pc`	Frozen Image/Text	L2 Similarity Tensor
`192_cos_tensor_all`	Cosine Variant	Cosine Similarity Tensor
`192_cos_matrix_all`	Pairwise Matrix	3× Pairwise Similarity Matrices
`192_cos_matrix_pc`	Pairwise (Frozen)	3× Pairwise Similarity Matrices
`192_cos_matrix_IP_pc`	Image-Point Only	1× Similarity Matrix (I-L)

How to Load the Models

You can load these checkpoints directly into your PyTorch environment using the huggingface_hub library.

Prerequisites

pip install torch huggingface_hub

Loading a Specific Checkpoint

import torch
from huggingface_hub import hf_hub_download

REPO_ID = "Ximeng0831/CTP"
# Example: Loading the default proposed model
SUBFOLDER = "pointnet2/192_l2_cube_all"
FILENAME = "ckpt_epoch9.pt"

checkpoint_path = hf_hub_download(
    repo_id=REPO_ID,
    subfolder=SUBFOLDER,
    filename=FILENAME
)

# Load CTP model
# model = ctp(text_encoder, image_encoder, lidar_encoder, loss_fn)
checkpoint = torch.load(checkpoint_path, map_location="cpu")
model.load_state_dict(checkpoint)
model.eval()

Source code: https://github.com/TAMU-CVRL/CTP

Training Configurations

Detailed configuration files (YAML) for each experiment are available in the Official GitHub Repository.

all: Training is performed for 10 epochs with a total batch size of 384. These models are trained using two NVIDIA A100 (40G) GPUs.
pc: Training is conducted for 20 epochs with a batch size of 192. These models are trained on a single NVIDIA RTX 4090 GPU.

Note: For specific hyperparameter settings such as learning rate schedules and weight decay, please refer to the corresponding .yaml files in the link above.

Downloads last month: -; Downloads are not tracked for this model. How to track

Papers for Ximeng0831/CTP

Toward Unified Multimodal Representation Learning for Autonomous Driving

Paper • 2603.07874 • Published 19 days ago

Learning Transferable Visual Models From Natural Language Supervision

Paper • 2103.00020 • Published Feb 26, 2021 • 20