YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
BioRLHF
Biological Reinforcement Learning from Human Feedback β A framework for fine-tuning LLMs on biological reasoning tasks using SFT, DPO, and GRPO with verifier-based reward models for factual accuracy, calibrated uncertainty, and chain-of-thought reasoning.
Highlights
- Three-stage training pipeline: SFT β DPO β GRPO with verifier-based rewards
- Multi-reward GRPO: Four composable verifiers (factual, pathway, consistency, uncertainty) with configurable weights
- +19% reward improvement over SFT baseline using GRPO (0.650 vs 0.547)
- -70% calibration error: ECE reduced from 0.258 to 0.078 after GRPO
- 90% accuracy on domain-specific biological reasoning tasks (SFT stage)
- Learns from 363 examples β efficient domain adaptation from spaceflight transcriptomics data
Key Results
GRPO Training (Phase 3)
| Metric | SFT Baseline | After GRPO | Improvement |
|---|---|---|---|
| Avg Reward | 0.547 | 0.650 | +19% |
| ECE (Calibration Error) | 0.258 | 0.078 | -70% |
GRPO Configuration (Full v2):
- 16 generations per prompt (G=16) for robust advantage estimation
- Multi-reward: V1 (factual, 0.35) + V2 (pathway, 0.30) + V3 (consistency, 0.15) + V4 (uncertainty, 0.20)
- KL penalty beta=0.02, 2 iterations per batch, group-normalized rewards
Model Comparison (SFT, 20-question evaluation)
| Model | Overall | Factual | Reasoning | Calibration |
|---|---|---|---|---|
| Mistral-7B | 90.0% | 80.0% | 100.0% | 100.0% |
| Qwen2.5-7B | 40.0% | 30.0% | 80.0% | 20.0% |
| Phi-2 | 25.0% | 20.0% | 60.0% | 0.0% |
SFT Training Progression
| Version | Accuracy | Key Improvement |
|---|---|---|
| v1 (Base SFT) | ~20% | Format learned, facts wrong |
| v2 (Expanded) | ~60% | More examples helped |
| v3 (Fact Drilling) | ~80% | Repetition fixed key facts |
| v4 (Advanced) | ~85% | Chain-of-thought, calibration |
| Final | 90% | Targeted drilling for remaining errors |
Installation
From Source
git clone https://github.com/jang1563/BioRLHF.git
cd BioRLHF
pip install -e .
With Development Dependencies
pip install -e ".[dev]"
GPU Requirements
- NVIDIA GPU with 48GB+ VRAM recommended (A40 or A100)
- 24GB+ VRAM sufficient for SFT/DPO with 4-bit quantization
- CUDA 12.1+ recommended
Quick Start
SFT Training
from biorlhf import SFTTrainingConfig, run_sft_training
config = SFTTrainingConfig(
model_name="mistralai/Mistral-7B-v0.3",
dataset_path="data/kmp_sft_final.json",
output_dir="./my_sft_model",
num_epochs=10,
learning_rate=1e-4,
)
model_path = run_sft_training(config)
GRPO Training with Verifiers
# Using the CLI
biorlhf-grpo --config configs/grpo_full_v2.json
# Or programmatically
from biorlhf.training.grpo import GRPOConfig, run_grpo_training
config = GRPOConfig.from_json("configs/grpo_full_v2.json")
run_grpo_training(config)
Creating a Dataset
from biorlhf.data import create_sft_dataset
dataset = create_sft_dataset(
output_path="my_dataset.json",
include_calibration=True,
include_chain_of_thought=True,
)
print(f"Created {len(dataset)} training examples")
Evaluating a Model
from biorlhf import evaluate_model
result = evaluate_model(
model_path="./my_sft_model",
test_questions_path="data/kmp_test_set.json",
)
print(f"Overall Accuracy: {result.overall_accuracy:.1%}")
print(f"Factual: {result.factual_accuracy:.1%}")
print(f"Reasoning: {result.reasoning_accuracy:.1%}")
print(f"Calibration: {result.calibration_accuracy:.1%}")
Running Inference
from biorlhf.utils import load_model_for_inference, generate_response
model, tokenizer = load_model_for_inference(
model_path="./my_sft_model",
base_model="mistralai/Mistral-7B-v0.3",
)
prompt = "### Instruction:\nWhich tissue is most sensitive to ionizing radiation?\n\n### Response:\n"
response = generate_response(model, tokenizer, prompt)
print(response)
Architecture
Three-Stage Training Pipeline
Stage 1: SFT Stage 2: DPO Stage 3: GRPO
(Supervised Fine-Tuning) (Direct Preference (Group Relative Policy
Optimization) Optimization)
Mistral-7B-v0.3 SFT model SFT model (merged)
| | |
LoRA (r=64, alpha=128) Preference pairs Generate G=16 completions
| | |
363 training examples Ranked responses Score with V1-V4 verifiers
| | |
10 epochs, lr=1e-4 beta=0.1 Multi-reward composition
| | |
SFT Adapter DPO Model GRPO Model
Verifier-Based Reward System (V1-V4)
| Verifier | Name | Weight | What It Scores |
|---|---|---|---|
| V1 | Factual | 0.35 | Exact match of biological facts (DEG counts, tissue names, directions) |
| V2 | Pathway | 0.30 | Correct pathway/gene set enrichment references (Hallmark, KEGG) |
| V3 | Consistency | 0.15 | Internal logical consistency within the response |
| V4 | Uncertainty | 0.20 | Appropriate confidence calibration and epistemic humility |
The verifiers are composable via RewardComposer and can be individually weighted:
from biorlhf.verifiers import RewardComposer
composer = RewardComposer(
active_verifiers=["V1", "V2", "V3", "V4"],
weights={"V1": 0.35, "V2": 0.30, "V3": 0.15, "V4": 0.20},
)
reward = composer.score(question, response, ground_truth)
Dataset
Training data is derived from a 2x2x2 factorial transcriptomic study:
- Drug: Kaempferol (KMP) vs Control
- Stressor 1: Hindlimb Unloading (HU) β simulates microgravity
- Stressor 2: Ionizing Radiation (IR) β simulates space radiation
- Tissues: Heart, Hippocampus, Liver, Soleus (+ Eye, Thymus for GRPO hold-out)
Training Example Types
| Type | Count | Purpose |
|---|---|---|
| Factual Q&A | ~150 | Specific facts (DEG counts, tissue types) |
| Chain-of-Thought | ~50 | Step-by-step reasoning |
| Calibration | ~30 | Uncertainty expression |
| Multi-hop Reasoning | ~30 | Integrating multiple facts |
| Error Correction | ~20 | Learning from mistakes |
Ground Truth Data
from biorlhf.data import (
STRESSOR_EFFECTS,
KMP_EFFECTS,
INTERACTIONS,
TISSUE_TYPES,
OXPHOS_PATTERNS,
)
# Example: Get DEG counts for stressors
print(STRESSOR_EFFECTS["Hippocampus"])
# {'HU': 1555, 'IR': 5477, 'HU_IR': 5510}
Project Structure
BioRLHF/
βββ src/biorlhf/ # Main package
β βββ training/ # SFT, DPO, and GRPO trainers
β βββ data/ # Dataset creation & ground truth
β βββ evaluation/ # Model evaluation & calibration
β βββ verifiers/ # V1-V4 reward verifiers
β β βββ factual.py # V1: Factual accuracy scoring
β β βββ pathway.py # V2: Pathway enrichment scoring
β β βββ consistency.py # V3: Logical consistency scoring
β β βββ uncertainty.py # V4: Calibration/uncertainty scoring
β β βββ composer.py # Multi-reward composition
β βββ utils/ # Model loading, inference helpers
β βββ cli.py # Command-line interface
βββ configs/ # Training configurations
β βββ grpo_mve.json # Minimum viable experiment
β βββ grpo_full_v2.json # Full multi-reward training
βββ data/ # Training datasets
β βββ kmp_sft_final.json # 363 SFT training examples
β βββ kmp_test_set.json # 20-question evaluation set
βββ examples/ # Usage examples
βββ scripts/ # SLURM job scripts & HPC guide
βββ tests/ # Unit tests
βββ docs/ # Documentation
Scientific Contributions
1. Verifier-Based GRPO Improves Calibration
- GRPO with V1-V4 verifiers reduced calibration error (ECE) by 70%
- Multi-reward composition outperforms single-reward training
- G=16 generations dramatically reduces zero-variance batches (from 50% to <5%)
2. Fact Drilling Works for SFT
- Initial training: 20% accuracy on key facts
- After targeted repetition: 100% accuracy on drilled facts
- LLMs need explicit reinforcement of specific domain facts
3. Calibration is Learnable
- Trained on "I cannot determine X from this data" examples
- Mistral achieved 100% calibration accuracy at SFT stage
- GRPO further improved calibration via the V4 uncertainty verifier
4. DPO is Fragile for Domain Knowledge
- Aggressive DPO (beta=0.05) destroyed learned knowledge
- Model hallucinated unrelated content
- Preference learning needs careful tuning in specialized domains
5. Architecture Matters More Than Size
- Mistral-7B >> Qwen2.5-7B despite similar parameter counts
- Phi-2 (2.7B) insufficient for complex biological reasoning
- Model selection is critical for domain fine-tuning
Key Learnings for AI Safety
- Honesty is trainable β Models can learn appropriate epistemic humility
- Domain grounding matters β Anchoring to experimental truth prevents hallucination
- Multi-reward > single reward β Decomposing correctness into verifiable dimensions improves learning signal
- Preference learning is fragile β DPO can catastrophically forget domain knowledge
- Evaluation drives improvement β Systematic testing reveals specific failure modes
Related Projects
- SpaceOmicsBench β 115-question benchmark for LLMs on spaceflight biomedical data
Citation
If you use BioRLHF in your research, please cite:
@software{biorlhf2026,
author = {Kim, JangKeun},
title = {BioRLHF: Biological Reinforcement Learning from Human Feedback},
year = {2026},
url = {https://github.com/jang1563/BioRLHF},
note = {Fine-tuning LLMs for biological reasoning with verifier-based GRPO}
}
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
License
This project is licensed under the MIT License β see the LICENSE file for details.
Developed by JangKeun Kim (jak4013@med.cornell.edu), Weill Cornell Medicine