Update README.md

Browse files

Files changed (1) hide show

README.md +111 -154

README.md CHANGED Viewed

@@ -1,199 +1,156 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
+base_model: ibm-granite/granite-4.0-micro
 library_name: transformers
+pipeline_tag: text-generation
+license: apache-2.0
+language:
+  - en
+tags:
+  - medical
+  - instruction-tuned
+  - jepa-llm
+  - grpo
+  - dpo-like
+  - personas
+  - mergekit
+  - arcee-fusion
+  - openmed
 ---
+# openmed-community/granite-4.0-micro-OpenMed
+**Granite 4.0 Micro (≈3B) tuned for medical education & instruction following.**
+Recipe: **JEPA-LLM SFT on medmcqa-hard + personas augmentation → GRPO on medmcqa-hard**; finalized with **Arcee Fusion** merge back into the IBM base.
+> ⚠️ **Medical safety**
+> This model is **not** a clinician and may hallucinate. **Do not** use for diagnosis or treatment. Use under qualified medical supervision only.
+---
+## TL;DR
+- **Base:** [`ibm-granite/granite-4.0-micro`](https://huggingface.co/ibm-granite/granite-4.0-micro) — 3B long-context instruct model (Apache-2.0). Includes a structured chat template and tool-calling examples.
+- **Training (high-level):**
+  1) **JEPA-LLM SFT (400 steps, bs=64)** on **`mkurman/medmcqa-hard`** plus **instruction-following personas** from **`allenai/tulu-3-sft-personas-instruction-following`**.
+  2) **GRPO** (group-relative PPO) on **`mkurman/medmcqa-hard`**, bs **64/128**, **8 generations per item** (critic-free RL optimizing verifiable correctness).
+  3) **Model merge:** **Arcee MergeKit** with `merge_method: arcee_fusion` to preserve base calibration while keeping domain gains.
+- **Infra:** Trained/evaluated on **AMD Instinct MI300X** via **Hot AISLE** credits — thanks!
+---
+## What’s inside
+### 1) JEPA-LLM stage (supervised)
+- **JEPA-LLM** objective, see repo: [mkurman/jepa-llm](https://github.com/mkurman/jepa-llm), used as an auxiliary signal during SFT to bias toward stable, representation-level learning rather than pure next-token fitting; run for **400 steps** on **MedMCQA-hard** with **Personas augmentation** from **Tulu-3 personas** (adds constraint-following behaviors and improves coverage of IFEval-style requirements).
+### 2) GRPO stage (reinforcement learning)
+- **GRPO** replaces the critic with group baselines, enabling efficient multi-sample training; we generate **8 candidates per item** and reward answer correctness / format checks.
+### 3) Merge & finalize
+- **Arcee Fusion** in **MergeKit** to selectively fuse with the original Granite 4.0 Micro (avoids over-averaging from naive merges and tends to keep base calibration).
+---
+## Intended use & limitations
+**Intended:** medical **research**, concept review, exam-style Q&A, instruction-following research, and tool-augmented demos.
+**Out of scope:** autonomous clinical decisions, prescription generation, or guideline updates without retrieval/RAG.
+---
+## Results
+| Metric                      | granite-4.0-micro-OpenMed  | granite-4.0-micro         |
+| ----------------------------| -------------------------: | ------------------------: |
+| mmlu                        |                  **63.17** |                     62.48 |
+| leaderboard_mmlu_pro        |                  **33.06** |                     32.78 |
+| leaderboard_ifeval          | granite-4.0-micro-OpenMed  | granite-4.0-micro         |
+| ----------------------------| -------------------------: | ------------------------: |
+| inst_level_loose_acc        |                  **85.97** |                     85.25 |
+| inst_level_strict_acc       |                  **84.05** |                     82.97 |
+| prompt_level_loose_acc      |                  **79.67** |                     78.74 |
+| prompt_level_strict_acc     |                  **77.45** |                     76.16 |
+**Author’s harness notes:** EleutherAI `lm-evaluation-harness` with Granite’s chat template and batch size 8.
+---
+## Quickstart (Transformers)
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+model_id = "openmed-community/granite-4.0-micro-OpenMed"
+tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
+model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
+messages = [
+  {"role": "system", "content": "You are a careful medical assistant. Cite sources and warn this is not medical advice."},
+  {"role": "user", "content": "Cellulitis vs erysipelas: give 3 bullet differences and 1 caution."}
+]
+prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
+inputs = tok(prompt, return_tensors="pt").to(model.device)
+out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
+print(tok.decode(out[0], skip_special_tokens=True))
+````
+> **Tool-calling:** Granite’s card includes function-calling examples;
+>
+---
+## Reproduce key evals (example)
+```bash
+# Classic MMLU (5-shot typical)
+lm_eval --model hf \
+  --model_args pretrained=openmed-community/granite-4.0-micro-OpenMed,parallelize=True \
+  --tasks mmlu --batch_size 8 --apply-chat-template
+# MMLU-Pro (10-choice, harder)
+lm_eval --model hf \
+  --model_args pretrained=openmed-community/granite-4.0-micro-OpenMed,parallelize=True \
+  --tasks leaderboard_mmlu_pro --batch_size 8 --apply-chat-template
+# IFEVAL (verifiable instruction following)
+lm_eval --model hf \
+  --model_args pretrained=openmed-community/granite-4.0-micro-OpenMed,parallelize=True \
+  --tasks leaderboard_ifeval --batch_size 8 --apply-chat-template
+```
+---
+## Data & training notes
+* **MedMCQA-Hard (train split)** for domain supervision and RL rewards;.
+* **Tulu-3 personas** for instruction-following with constraint taxonomy inspired by IFEVAL.
+* **JEPA-LLM**: based on the emerging **LLM-JEPA** objective (representation-space training). See the paper for context and motivation.
+* **GRPO**: efficient for multi-sample training.
+* **Privacy:** no PHI to the best of our knowledge; please report issues.
+---
+## Commentary on results
+> **Why gains are modest:** Granite-4.0-Micro is already a **well-calibrated, strongly aligned** 3B instruct model with robust instruction-following and tool-use out of the box. In that regime, **headroom on popular benchmarks is limited**, and naive tuning often **degrades** base behaviors (calibration, safety, IF). The combination used here—**JEPA-LLM** (to stabilize representations), **personas SFT** (to preserve IF constraints), **GRPO** with **verifiable rewards**, and **Arcee Fusion**—appears to **nudge** the model to measurable improvements **without sacrificing** base calibration, but the effect sizes remain small, which is consistent with Granite’s strong baseline. In short: *we’re operating near the model’s alignment ceiling; targeted gains are possible, sweeping jumps are unlikely without larger capacity or richer supervision.*
+---
+## Acknowledgments
+* **IBM Granite** team for the base model & docs (Apache-2.0).
+* **AllenAI Tulu-3** for personas datasets.
+* **Arcee** for MergeKit and **Arcee Fusion**.
+* **Hot Aisle** for MI300X credits :heart:, link: [https://hotaisle.xyz/](https://hotaisle.xyz/).
+---
+## Citation
+* IBM Granite 4.0 Micro model card [1](https://huggingface.co/ibm-granite/granite-4.0-micro).
+* MedMCQA-Hard [2](https://huggingface.co/datasets/mkurman/medmcqa-hard).
+* Tulu-3 personas dataset [3](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following).
+* LLM-JEPA paper [4](https://arxiv.org/abs/2509.14252) and our implementation repository [5](https://github.com/mkurman/jepa-llm).
+* MergeKit & Arcee Fusion [6](https://github.com/arcee-ai/mergekit).