# Training on Hugging Face with GPUs This guide explains how to train the Energy Halting experiment on Hugging Face infrastructure, including local GPU training with `accelerate` and deployment to Hugging Face Spaces. ## Prerequisites 1. **Hugging Face Account**: Create one at [huggingface.co](https://huggingface.co). 2. **Access Token**: Get a write token from [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). 3. **Pixi**: Installed locally. ## 1. Local Training with Accelerate We use Hugging Face `accelerate` for robust multi-GPU and mixed-precision training. ### Setup Ensure dependencies are installed: ```bash pixi install ``` ### Configure Accelerate Run the configuration wizard to set up your GPU environment (e.g., number of GPUs, mixed precision): ```bash pixi run accelerate config ``` ### Run Training Use `accelerate launch` to start training. This handles device placement automatically. ```bash pixi run accelerate launch tasks/image_classification/train_energy.py \ --energy_head_enabled \ --loss_type energy_contrastive \ --dataset cifar10 \ --batch_size 32 \ --use_amp \ --push_to_hub \ --hub_model_id /ctm-energy-cifar10 \ --hub_token ``` ## 2. Deploying to Hugging Face Spaces (GPU) You can run this training job on a Hugging Face Space with a GPU. ### Create a Space 1. Go to [huggingface.co/new-space](https://huggingface.co/new-space). 2. Name: `ctm-energy-training` (or similar). 3. SDK: **Docker**. 4. Hardware: Choose a **GPU** instance (e.g., Nvidia T4, A10G). ### Deploy Code You can deploy by pushing your code to the Space's repository. 1. **Clone the Space**: ```bash git clone https://huggingface.co/spaces//ctm-energy-training cd ctm-energy-training ``` 2. **Copy Files**: Copy your project files into this directory (excluding `.git`, `.pixi`, `data`, `logs`). _Crucially, ensure `Dockerfile`, `pixi.toml`, `pixi.lock`, `tasks/`, `models/`, `utils/`, and `configs/` are present._ 3. **Push**: ```bash git add . git commit -m "Deploy training job" git push ``` ### Environment Variables To allow the Space to push the trained model back to the Hub, you need to set your HF token as a secret. 1. Go to your Space's **Settings**. 2. Scroll to **Variables and secrets**. 3. Add a New Secret: - Name: `HF_TOKEN` - Value: Your write token. ### Update Dockerfile CMD (Optional) The default `Dockerfile` CMD prints help. To run training immediately upon deployment, modify the `CMD` in the `Dockerfile` before pushing: ```dockerfile CMD ["--energy_head_enabled", "--loss_type", "energy_contrastive", "--push_to_hub", "--hub_model_id", "/ctm-energy-cifar10", "--hub_token", "$HF_TOKEN"] ``` _Note: You'll need to pass the token via env var or arg._ ## 3. Monitoring - **Local**: Check the `logs/` directory or WandB if enabled (`--wandb`). - **Spaces**: Check the **Logs** tab in your Space.