Uday's picture
Add HF training integration and fix binary file tracking
c8c8629

Training on Hugging Face with GPUs

This guide explains how to train the Energy Halting experiment on Hugging Face infrastructure, including local GPU training with accelerate and deployment to Hugging Face Spaces.

Prerequisites

  1. Hugging Face Account: Create one at huggingface.co.
  2. Access Token: Get a write token from huggingface.co/settings/tokens.
  3. Pixi: Installed locally.

1. Local Training with Accelerate

We use Hugging Face accelerate for robust multi-GPU and mixed-precision training.

Setup

Ensure dependencies are installed:

pixi install

Configure Accelerate

Run the configuration wizard to set up your GPU environment (e.g., number of GPUs, mixed precision):

pixi run accelerate config

Run Training

Use accelerate launch to start training. This handles device placement automatically.

pixi run accelerate launch tasks/image_classification/train_energy.py \
    --energy_head_enabled \
    --loss_type energy_contrastive \
    --dataset cifar10 \
    --batch_size 32 \
    --use_amp \
    --push_to_hub \
    --hub_model_id <your-username>/ctm-energy-cifar10 \
    --hub_token <your-token>

2. Deploying to Hugging Face Spaces (GPU)

You can run this training job on a Hugging Face Space with a GPU.

Create a Space

  1. Go to huggingface.co/new-space.
  2. Name: ctm-energy-training (or similar).
  3. SDK: Docker.
  4. Hardware: Choose a GPU instance (e.g., Nvidia T4, A10G).

Deploy Code

You can deploy by pushing your code to the Space's repository.

  1. Clone the Space:

    git clone https://huggingface.co/spaces/<your-username>/ctm-energy-training
    cd ctm-energy-training
    
  2. Copy Files: Copy your project files into this directory (excluding .git, .pixi, data, logs). Crucially, ensure Dockerfile, pixi.toml, pixi.lock, tasks/, models/, utils/, and configs/ are present.

  3. Push:

    git add .
    git commit -m "Deploy training job"
    git push
    

Environment Variables

To allow the Space to push the trained model back to the Hub, you need to set your HF token as a secret.

  1. Go to your Space's Settings.
  2. Scroll to Variables and secrets.
  3. Add a New Secret:
    • Name: HF_TOKEN
    • Value: Your write token.

Update Dockerfile CMD (Optional)

The default Dockerfile CMD prints help. To run training immediately upon deployment, modify the CMD in the Dockerfile before pushing:

CMD ["--energy_head_enabled", "--loss_type", "energy_contrastive", "--push_to_hub", "--hub_model_id", "<your-username>/ctm-energy-cifar10", "--hub_token", "$HF_TOKEN"]

Note: You'll need to pass the token via env var or arg.

3. Monitoring

  • Local: Check the logs/ directory or WandB if enabled (--wandb).
  • Spaces: Check the Logs tab in your Space.