YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

BioRLHF

Biological Reinforcement Learning from Human Feedback: A framework for fine-tuning LLMs on biological reasoning tasks using SFT, DPO, and GRPO with verifier-based reward models for factual accuracy, calibrated uncertainty, and chain-of-thought reasoning.

Highlights

Three-stage training pipeline: SFT → DPO → GRPO with verifier-based rewards
Multi-reward GRPO: Four composable verifiers (factual, pathway, consistency, uncertainty) with configurable weights
+19% reward improvement over SFT baseline using GRPO (0.650 vs 0.547)
-70% calibration error: ECE reduced from 0.258 to 0.078 after GRPO
90% accuracy on domain-specific biological reasoning tasks (SFT stage)
Learns from 363 examples: efficient domain adaptation from spaceflight transcriptomics data

Key Results

GRPO Training (Phase 3)

Metric	SFT Baseline	After GRPO	Improvement
Avg Reward	0.547	0.650	+19%
ECE (Calibration Error)	0.258	0.078	-70%

GRPO Configuration (Full v2):

16 generations per prompt (G=16) for robust advantage estimation
Multi-reward: V1 (factual, 0.35) + V2 (pathway, 0.30) + V3 (consistency, 0.15) + V4 (uncertainty, 0.20)
KL penalty beta=0.02, 2 iterations per batch, group-normalized rewards

Model Comparison (SFT, 20-question evaluation)

Model	Overall	Factual	Reasoning	Calibration
Mistral-7B	90.0%	80.0%	100.0%	100.0%
Qwen2.5-7B	40.0%	30.0%	80.0%	20.0%
Phi-2	25.0%	20.0%	60.0%	0.0%

SFT Training Progression

Version	Accuracy	Key Improvement
v1 (Base SFT)	~20%	Format learned, facts wrong
v2 (Expanded)	~60%	More examples helped
v3 (Fact Drilling)	~80%	Repetition fixed key facts
v4 (Advanced)	~85%	Chain-of-thought, calibration
Final	90%	Targeted drilling for remaining errors

Installation

From Source

git clone https://github.com/jang1563/BioRLHF.git
cd BioRLHF
pip install -e .

With Development Dependencies

pip install -e ".[dev]"

GPU Requirements

NVIDIA GPU with 48GB+ VRAM recommended (A40 or A100)
24GB+ VRAM sufficient for SFT/DPO with 4-bit quantization
CUDA 12.1+ recommended

Quick Start

SFT Training

from biorlhf import SFTTrainingConfig, run_sft_training

config = SFTTrainingConfig(
    model_name="mistralai/Mistral-7B-v0.3",
    dataset_path="data/kmp_sft_final.json",
    output_dir="./my_sft_model",
    num_epochs=10,
    learning_rate=1e-4,
)

model_path = run_sft_training(config)

GRPO Training with Verifiers

# Using the CLI
biorlhf-grpo --config configs/grpo_full_v2.json

# Or programmatically
from biorlhf.training.grpo import GRPOConfig, run_grpo_training

config = GRPOConfig.from_json("configs/grpo_full_v2.json")
run_grpo_training(config)

Creating a Dataset

from biorlhf.data import create_sft_dataset

dataset = create_sft_dataset(
    output_path="my_dataset.json",
    include_calibration=True,
    include_chain_of_thought=True,
)

print(f"Created {len(dataset)} training examples")

Evaluating a Model

from biorlhf import evaluate_model

result = evaluate_model(
    model_path="./my_sft_model",
    test_questions_path="data/kmp_test_set.json",
)

print(f"Overall Accuracy: {result.overall_accuracy:.1%}")
print(f"Factual: {result.factual_accuracy:.1%}")
print(f"Reasoning: {result.reasoning_accuracy:.1%}")
print(f"Calibration: {result.calibration_accuracy:.1%}")

Running Inference

from biorlhf.utils import load_model_for_inference, generate_response

model, tokenizer = load_model_for_inference(
    model_path="./my_sft_model",
    base_model="mistralai/Mistral-7B-v0.3",
)

prompt = "### Instruction:\nWhich tissue is most sensitive to ionizing radiation?\n\n### Response:\n"
response = generate_response(model, tokenizer, prompt)
print(response)

Architecture

Three-Stage Training Pipeline

Stage 1: SFT                    Stage 2: DPO                Stage 3: GRPO
(Supervised Fine-Tuning)        (Direct Preference          (Group Relative Policy
                                 Optimization)               Optimization)

Mistral-7B-v0.3                 SFT model                   SFT model (merged)
      |                              |                            |
   LoRA (r=64, alpha=128)       Preference pairs            Generate G=16 completions
      |                              |                            |
   363 training examples         Ranked responses           Score with V1-V4 verifiers
      |                              |                            |
   10 epochs, lr=1e-4            beta=0.1                   Multi-reward composition
      |                              |                            |
   SFT Adapter                  DPO Model                   GRPO Model

Verifier-Based Reward System (V1-V4)

Verifier	Name	Weight	What It Scores
V1	Factual	0.35	Exact match of biological facts (DEG counts, tissue names, directions)
V2	Pathway	0.30	Correct pathway/gene set enrichment references (Hallmark, KEGG)
V3	Consistency	0.15	Internal logical consistency within the response
V4	Uncertainty	0.20	Appropriate confidence calibration and epistemic humility

The verifiers are composable via RewardComposer and can be individually weighted:

from biorlhf.verifiers import RewardComposer

composer = RewardComposer(
    active_verifiers=["V1", "V2", "V3", "V4"],
    weights={"V1": 0.35, "V2": 0.30, "V3": 0.15, "V4": 0.20},
)

reward = composer.score(question, response, ground_truth)

Dataset

Training data is derived from a 2x2x2 factorial transcriptomic study:

Drug: Kaempferol (KMP) vs Control
Stressor 1: Hindlimb Unloading (HU): simulates microgravity
Stressor 2: Ionizing Radiation (IR): simulates space radiation
Tissues: Heart, Hippocampus, Liver, Soleus (+ Eye, Thymus for GRPO hold-out)

Training Example Types

Type	Count	Purpose
Factual Q&A	~150	Specific facts (DEG counts, tissue types)
Chain-of-Thought	~50	Step-by-step reasoning
Calibration	~30	Uncertainty expression
Multi-hop Reasoning	~30	Integrating multiple facts
Error Correction	~20	Learning from mistakes

Ground Truth Data

from biorlhf.data import (
    STRESSOR_EFFECTS,
    KMP_EFFECTS,
    INTERACTIONS,
    TISSUE_TYPES,
    OXPHOS_PATTERNS,
)

# Example: Get DEG counts for stressors
print(STRESSOR_EFFECTS["Hippocampus"])
# {'HU': 1555, 'IR': 5477, 'HU_IR': 5510}

Project Structure

BioRLHF/
├── src/biorlhf/              # Main package
│   ├── training/             # SFT, DPO, and GRPO trainers
│   ├── data/                 # Dataset creation & ground truth
│   ├── evaluation/           # Model evaluation & calibration
│   ├── verifiers/            # V1-V4 reward verifiers
│   │   ├── factual.py        #   V1: Factual accuracy scoring
│   │   ├── pathway.py        #   V2: Pathway enrichment scoring
│   │   ├── consistency.py    #   V3: Logical consistency scoring
│   │   ├── uncertainty.py    #   V4: Calibration/uncertainty scoring
│   │   └── composer.py       #   Multi-reward composition
│   ├── utils/                # Model loading, inference helpers
│   └── cli.py                # Command-line interface
├── configs/                  # Training configurations
│   ├── grpo_mve.json         #   Minimum viable experiment
│   └── grpo_full_v2.json     #   Full multi-reward training
├── data/                     # Training datasets
│   ├── kmp_sft_final.json    #   363 SFT training examples
│   └── kmp_test_set.json     #   20-question evaluation set
├── examples/                 # Usage examples
├── scripts/                  # SLURM job scripts & HPC guide
├── tests/                    # Unit tests
└── docs/                     # Documentation

Scientific Contributions

1. Verifier-Based GRPO Improves Calibration

GRPO with V1-V4 verifiers reduced calibration error (ECE) by 70%
Multi-reward composition outperforms single-reward training
G=16 generations dramatically reduces zero-variance batches (from 50% to <5%)

2. Fact Drilling Works for SFT

Initial training: 20% accuracy on key facts
After targeted repetition: 100% accuracy on drilled facts
LLMs need explicit reinforcement of specific domain facts

3. Calibration is Learnable

Trained on "I cannot determine X from this data" examples
Mistral achieved 100% calibration accuracy at SFT stage
GRPO further improved calibration via the V4 uncertainty verifier

4. DPO is Fragile for Domain Knowledge

Aggressive DPO (beta=0.05) destroyed learned knowledge
Model hallucinated unrelated content
Preference learning needs careful tuning in specialized domains

5. Architecture Matters More Than Size

Mistral-7B >> Qwen2.5-7B despite similar parameter counts
Phi-2 (2.7B) insufficient for complex biological reasoning
Model selection is critical for domain fine-tuning

Key Learnings for AI Safety

Honesty is trainable: Models can learn appropriate epistemic humility
Domain grounding matters: Anchoring to experimental truth prevents hallucination
Multi-reward > single reward: Decomposing correctness into verifiable dimensions improves learning signal
Preference learning is fragile: DPO can catastrophically forget domain knowledge
Evaluation drives improvement: Systematic testing reveals specific failure modes

Related Projects

SpaceOmicsBench: 115-question benchmark for LLMs on spaceflight biomedical data

Citation

If you use BioRLHF in your research, please cite:

@software{biorlhf2026,
  author = {Kim, JangKeun},
  title = {BioRLHF: Biological Reinforcement Learning from Human Feedback},
  year = {2026},
  url = {https://github.com/jang1563/BioRLHF},
  note = {Fine-tuning LLMs for biological reasoning with verifier-based GRPO}
}

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

This project is licensed under the MIT License: see the LICENSE file for details.

Developed by JangKeun Kim (jak4013@med.cornell.edu), Weill Cornell Medicine

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support