blux-ca / train /README.md
~JADIS
Improve training validation and offline safety flow (#9)
5ce8003
# BLUX-cA Adapter Training Pipeline
This folder contains a reproducible adapter (LoRA/QLoRA) pipeline for BLUX-cA using the external dataset repository. The dataset must live **outside** this repository; set `DATASET_DIR` to its absolute path (for example `/workspace/blux-ca-dataset`).
## Prerequisites
- Python 3.10+
- Recommended: NVIDIA GPU with recent CUDA drivers
- Sufficient disk space/memory for the base model (default: `Qwen/Qwen2.5-7B-Instruct`)
## Environment setup
```bash
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r train/requirements.txt
```
## Dataset layout
The dataset directory should contain:
```
prompts/system_core.txt
data/*.jsonl
eval/*.jsonl
```
Each training JSONL line must include a `messages` array containing `system`, `user`, and `assistant` roles. The system content must equal `<SYSTEM_PROMPT_FROM_BLUX_CA>`.
## Commands (copy/paste)
Set the dataset location once per shell:
```bash
export DATASET_DIR=/absolute/path/to/blux-ca-dataset
```
Validate dataset strictly (always invokes the dataset repo validator first):
```bash
python train/validate_dataset.py --dataset-dir "$DATASET_DIR" --strict
```
Dry-run (loads base model, prepares 5 samples, tokenizes). On CPU-only hosts the base model automatically falls back to
`Qwen/Qwen2.5-1.5B-Instruct` unless you override `BASE_MODEL`:
```bash
python train/train_adapter.py --dataset-dir "$DATASET_DIR" --dry-run
```
Smoke train (adapter, capped mix):
```bash
python train/train_adapter.py --dataset-dir "$DATASET_DIR" --max-samples 200 --run-name smoke
```
Full train:
```bash
python train/train_adapter.py --dataset-dir "$DATASET_DIR" --run-name full
```
Eval gate (strict). Use `--use-stub` when running without a trained adapter or when offline:
```bash
python train/run_eval.py --dataset-dir "$DATASET_DIR" --run runs/<timestamp_or_name> --strict --use-stub
```
GPU is recommended for smoke/full runs. On CPU-only environments, set `BASE_MODEL=Qwen/Qwen2.5-1.5B-Instruct` for the dry-run to conserve memory.
## Outputs
- Runs are created under `runs/YYYYMMDD_HHMMSS_<optional_name>/`
- Prepared dataset + resolved mix: `runs/<timestamp>/prepared_train.jsonl` and `runs/<timestamp>/mix_config_resolved.yaml`
- Training artifacts: `runs/<timestamp>/adapter/` plus `runs/<timestamp>/training_args.json` and `config_snapshot.yaml`
- Evaluation report: `runs/<timestamp>/eval_report.md`
## Release checklist
- Dataset validated (`python train/validate_dataset.py --dataset-dir ... --strict`)
- Prepared dataset generated and referenced by run folder
- Evaluation run passes in strict mode
- Adapter artifacts present under `runs/<timestamp>/adapter/`
- Model card/README updated before publishing adapter (adapter-only, no base weights)
## Uploading adapter to Hugging Face Hub
```bash
git lfs track "*.safetensors"
cd runs/<timestamp>/adapter
# add README/model card as needed
```
Only upload the adapter weights—do not upload base model weights.