SimpleLM SFT

Custom decoder-only Transformer, supervised-fine-tuned on the MegaScience corpus for science question answering. Architecture is defined in modeling_simple_lm.py (bundled in this repo) and loaded via trust_remote_code=True.

  • SFT source checkpoint: models/sft_full_science.pt
  • Pretraining checkpoint: /home/etan/simple_llm/checkpoints/lm_checkpoint_008_shutdown.pt
  • Training data: /home/etan/simple_llm/datasets/MegaScience/data
  • subject_filter: None
  • subject_exclude: ['math']
  • question_regex_filter: None
  • SFT epochs: 1 at learning_rate 3e-05

Prompt format

This model was fine-tuned on a single fixed prompt template -- queries that don't match it will produce noticeably worse output. The packaged chat_template.jinja reproduces this format, so you can use tokenizer.apply_chat_template(...) directly and get byte-identical strings to what the model saw during training:

Question: What is photosynthesis?
Answer: <answer></s>

Equivalently, with the chat template:

tokenizer.apply_chat_template(
    [{"role": "user", "content": "What is photosynthesis?"}],
    add_generation_prompt=True, tokenize=False,
)
# -> 'Question: What is photosynthesis?\nAnswer: '

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "etanlightstone/simple-lm-sft-science"
tok   = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True).eval()

messages = [{"role": "user", "content": "What is photosynthesis?"}]
inputs = tok.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
)
prompt_len = inputs["input_ids"].shape[1]
with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.4,
        top_p=0.9,
        repetition_penalty=1.1,
    )
answer = tok.decode(out[0, prompt_len:], skip_special_tokens=True)
print(answer)

Architecture

field value
vocab_size 32000
context_length 512
d_model 768
n_layers 12
n_heads 8
d_ff 2048
activation gelu
bias True
tie_word_embeddings True

Tokenizer source: TinyLlama/TinyLlama-1.1B-Chat-v1.0

Training settings

{
  "mode": "sft",
  "source_pretrain_checkpoint": "/home/etan/simple_llm/checkpoints/lm_checkpoint_008_shutdown.pt",
  "source_pretrain_train_settings": {
    "batch_size": 10,
    "batch_size_note": "per GPU when using torchrun",
    "world_size": 1,
    "learning_rate": 0.0003,
    "weight_decay": 0.01,
    "num_epochs": 3,
    "max_steps": null,
    "grad_clip": 1.0,
    "seed": 42,
    "docs_dir": "/home/etan/simple_llm/docs",
    "block_size": 512,
    "stride": 448,
    "stride_overlap_tokens": 64
  },
  "data_dir": "/home/etan/simple_llm/datasets/MegaScience/data",
  "data_glob": "*.parquet",
  "subject_filter": null,
  "subject_exclude": [
    "math"
  ],
  "question_regex_filter": null,
  "batch_size": 10,
  "world_size": 1,
  "learning_rate": 3e-05,
  "min_lr": 3e-06,
  "warmup_steps": 200,
  "weight_decay": 0.0,
  "num_epochs": 1,
  "max_steps": null,
  "grad_clip": 1.0,
  "seed": 42,
  "block_size": 512,
  "eval_fraction": 0.005,
  "eval_every": 500,
  "max_train_examples": null,
  "freezing": {
    "freeze_embeddings": false,
    "freeze_lm_head": false,
    "freeze_blocks_below": 0,
    "tie_word_embeddings": true,
    "trainable_params": 91138560,
    "total_params": 91138560,
    "frozen_params": 0,
    "frozen_blocks": 0,
    "total_blocks": 12
  },
  "prompt_template": "Question: {question}\nAnswer: ",
  "completion_suffix": "</s>"
}
Downloads last month
29
Safetensors
Model size
91.1M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support