Spaces:

Uday
/

ctm-energy-based-halting

Paused

App Files Files Community

ctm-energy-based-halting / GUIDE_HF.md

Uday

Add HF training integration and fix binary file tracking

c8c8629 18 days ago

preview code

raw

history blame contribute delete

3.03 kB

	# Training on Hugging Face with GPUs

	This guide explains how to train the Energy Halting experiment on Hugging Face infrastructure, including local GPU training with `accelerate` and deployment to Hugging Face Spaces.

	## Prerequisites

	1. Hugging Face Account: Create one at [huggingface.co](https://huggingface.co).
	2. Access Token: Get a write token from [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).
	3. Pixi: Installed locally.

	## 1. Local Training with Accelerate

	We use Hugging Face `accelerate` for robust multi-GPU and mixed-precision training.

	### Setup

	Ensure dependencies are installed:

	```bash
	pixi install
	```

	### Configure Accelerate

	Run the configuration wizard to set up your GPU environment (e.g., number of GPUs, mixed precision):

	```bash
	pixi run accelerate config
	```

	### Run Training

	Use `accelerate launch` to start training. This handles device placement automatically.

	```bash
	pixi run accelerate launch tasks/image_classification/train_energy.py \
	--energy_head_enabled \
	--loss_type energy_contrastive \
	--dataset cifar10 \
	--batch_size 32 \
	--use_amp \
	--push_to_hub \
	--hub_model_id <your-username>/ctm-energy-cifar10 \
	--hub_token <your-token>
	```

	## 2. Deploying to Hugging Face Spaces (GPU)

	You can run this training job on a Hugging Face Space with a GPU.

	### Create a Space

	1. Go to [huggingface.co/new-space](https://huggingface.co/new-space).
	2. Name: `ctm-energy-training` (or similar).
	3. SDK: Docker.
	4. Hardware: Choose a GPU instance (e.g., Nvidia T4, A10G).

	### Deploy Code

	You can deploy by pushing your code to the Space's repository.

	1. Clone the Space:

	```bash
	git clone https://huggingface.co/spaces/<your-username>/ctm-energy-training
	cd ctm-energy-training
	```

	2. Copy Files:
	Copy your project files into this directory (excluding `.git`, `.pixi`, `data`, `logs`).
	_Crucially, ensure `Dockerfile`, `pixi.toml`, `pixi.lock`, `tasks/`, `models/`, `utils/`, and `configs/` are present._

	3. Push:
	```bash
	git add .
	git commit -m "Deploy training job"
	git push
	```

	### Environment Variables

	To allow the Space to push the trained model back to the Hub, you need to set your HF token as a secret.

	1. Go to your Space's Settings.
	2. Scroll to Variables and secrets.
	3. Add a New Secret:
	- Name: `HF_TOKEN`
	- Value: Your write token.

	### Update Dockerfile CMD (Optional)

	The default `Dockerfile` CMD prints help. To run training immediately upon deployment, modify the `CMD` in the `Dockerfile` before pushing:

	```dockerfile
	CMD ["--energy_head_enabled", "--loss_type", "energy_contrastive", "--push_to_hub", "--hub_model_id", "<your-username>/ctm-energy-cifar10", "--hub_token", "$HF_TOKEN"]
	```

	_Note: You'll need to pass the token via env var or arg._

	## 3. Monitoring

	- Local: Check the `logs/` directory or WandB if enabled (`--wandb`).
	- Spaces: Check the Logs tab in your Space.