File size: 3,034 Bytes
c8c8629 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
# Training on Hugging Face with GPUs
This guide explains how to train the Energy Halting experiment on Hugging Face infrastructure, including local GPU training with `accelerate` and deployment to Hugging Face Spaces.
## Prerequisites
1. **Hugging Face Account**: Create one at [huggingface.co](https://huggingface.co).
2. **Access Token**: Get a write token from [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).
3. **Pixi**: Installed locally.
## 1. Local Training with Accelerate
We use Hugging Face `accelerate` for robust multi-GPU and mixed-precision training.
### Setup
Ensure dependencies are installed:
```bash
pixi install
```
### Configure Accelerate
Run the configuration wizard to set up your GPU environment (e.g., number of GPUs, mixed precision):
```bash
pixi run accelerate config
```
### Run Training
Use `accelerate launch` to start training. This handles device placement automatically.
```bash
pixi run accelerate launch tasks/image_classification/train_energy.py \
--energy_head_enabled \
--loss_type energy_contrastive \
--dataset cifar10 \
--batch_size 32 \
--use_amp \
--push_to_hub \
--hub_model_id <your-username>/ctm-energy-cifar10 \
--hub_token <your-token>
```
## 2. Deploying to Hugging Face Spaces (GPU)
You can run this training job on a Hugging Face Space with a GPU.
### Create a Space
1. Go to [huggingface.co/new-space](https://huggingface.co/new-space).
2. Name: `ctm-energy-training` (or similar).
3. SDK: **Docker**.
4. Hardware: Choose a **GPU** instance (e.g., Nvidia T4, A10G).
### Deploy Code
You can deploy by pushing your code to the Space's repository.
1. **Clone the Space**:
```bash
git clone https://huggingface.co/spaces/<your-username>/ctm-energy-training
cd ctm-energy-training
```
2. **Copy Files**:
Copy your project files into this directory (excluding `.git`, `.pixi`, `data`, `logs`).
_Crucially, ensure `Dockerfile`, `pixi.toml`, `pixi.lock`, `tasks/`, `models/`, `utils/`, and `configs/` are present._
3. **Push**:
```bash
git add .
git commit -m "Deploy training job"
git push
```
### Environment Variables
To allow the Space to push the trained model back to the Hub, you need to set your HF token as a secret.
1. Go to your Space's **Settings**.
2. Scroll to **Variables and secrets**.
3. Add a New Secret:
- Name: `HF_TOKEN`
- Value: Your write token.
### Update Dockerfile CMD (Optional)
The default `Dockerfile` CMD prints help. To run training immediately upon deployment, modify the `CMD` in the `Dockerfile` before pushing:
```dockerfile
CMD ["--energy_head_enabled", "--loss_type", "energy_contrastive", "--push_to_hub", "--hub_model_id", "<your-username>/ctm-energy-cifar10", "--hub_token", "$HF_TOKEN"]
```
_Note: You'll need to pass the token via env var or arg._
## 3. Monitoring
- **Local**: Check the `logs/` directory or WandB if enabled (`--wandb`).
- **Spaces**: Check the **Logs** tab in your Space.
|