File size: 3,034 Bytes
c8c8629
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# Training on Hugging Face with GPUs

This guide explains how to train the Energy Halting experiment on Hugging Face infrastructure, including local GPU training with `accelerate` and deployment to Hugging Face Spaces.

## Prerequisites

1.  **Hugging Face Account**: Create one at [huggingface.co](https://huggingface.co).
2.  **Access Token**: Get a write token from [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).
3.  **Pixi**: Installed locally.

## 1. Local Training with Accelerate

We use Hugging Face `accelerate` for robust multi-GPU and mixed-precision training.

### Setup

Ensure dependencies are installed:

```bash
pixi install
```

### Configure Accelerate

Run the configuration wizard to set up your GPU environment (e.g., number of GPUs, mixed precision):

```bash
pixi run accelerate config
```

### Run Training

Use `accelerate launch` to start training. This handles device placement automatically.

```bash
pixi run accelerate launch tasks/image_classification/train_energy.py \
    --energy_head_enabled \
    --loss_type energy_contrastive \
    --dataset cifar10 \
    --batch_size 32 \
    --use_amp \
    --push_to_hub \
    --hub_model_id <your-username>/ctm-energy-cifar10 \
    --hub_token <your-token>
```

## 2. Deploying to Hugging Face Spaces (GPU)

You can run this training job on a Hugging Face Space with a GPU.

### Create a Space

1.  Go to [huggingface.co/new-space](https://huggingface.co/new-space).
2.  Name: `ctm-energy-training` (or similar).
3.  SDK: **Docker**.
4.  Hardware: Choose a **GPU** instance (e.g., Nvidia T4, A10G).

### Deploy Code

You can deploy by pushing your code to the Space's repository.

1.  **Clone the Space**:

    ```bash
    git clone https://huggingface.co/spaces/<your-username>/ctm-energy-training
    cd ctm-energy-training
    ```

2.  **Copy Files**:
    Copy your project files into this directory (excluding `.git`, `.pixi`, `data`, `logs`).
    _Crucially, ensure `Dockerfile`, `pixi.toml`, `pixi.lock`, `tasks/`, `models/`, `utils/`, and `configs/` are present._

3.  **Push**:
    ```bash
    git add .
    git commit -m "Deploy training job"
    git push
    ```

### Environment Variables

To allow the Space to push the trained model back to the Hub, you need to set your HF token as a secret.

1.  Go to your Space's **Settings**.
2.  Scroll to **Variables and secrets**.
3.  Add a New Secret:
    - Name: `HF_TOKEN`
    - Value: Your write token.

### Update Dockerfile CMD (Optional)

The default `Dockerfile` CMD prints help. To run training immediately upon deployment, modify the `CMD` in the `Dockerfile` before pushing:

```dockerfile
CMD ["--energy_head_enabled", "--loss_type", "energy_contrastive", "--push_to_hub", "--hub_model_id", "<your-username>/ctm-energy-cifar10", "--hub_token", "$HF_TOKEN"]
```

_Note: You'll need to pass the token via env var or arg._

## 3. Monitoring

- **Local**: Check the `logs/` directory or WandB if enabled (`--wandb`).
- **Spaces**: Check the **Logs** tab in your Space.