Frame 33

QED-75M

QED-75M is a compact decoder-only causal language model implemented for Hugging Face using a custom transformers module. The model architecture combines RoPE (rotary position embeddings), RMSNorm, SwiGLU feed-forward blocks, and causal self-attention implemented via torch.nn.functional.scaled_dot_product_attention. The token embedding weights can be tied with the output projection (tie_word_embeddings).

This model card focuses on the model itself (architecture, tensor interface, runtime constraints). Training data, training procedure, and export scripts are described in the repository README.md.

Table of Contents


Model Details

Model Description

QED is a next-token prediction model (causal LM). Given a sequence of token ids, the model produces logits over the vocabulary for each position. When labels are provided, the model computes the training loss as cross-entropy over the next-token targets (with ignore_index=-100).

The Hugging Face integration provides:

  • QEDConfig (model_type: qed)
  • QEDForCausalLM

Both classes are defined in the repo module modeling_qed.py and are loaded with trust_remote_code=True.

Model Sources

  • Code: the repository containing modeling_qed.py and the exported model artifacts.
  • Transformers implementation: modeling_qed.py (remote code in the model repo).
  • Training artifacts (checkpoints, logs, and related outputs): levossadtchi/QED-75M_artifacts.

Uses

Direct Use

  • Text generation using model.generate(...); the repository also includes a ready-to-run local inference script: generate_gravity_example.py.
  • Scoring / evaluating conditional likelihoods via model(input_ids=..., labels=...).

Downstream Use

  • Fine-tuning or adapting the model (for example, SFT or LoRA) is technically possible, but quality and safety must be validated for the target domain.

Out-of-Scope Use

  • Using the model for high-stakes decisions (medical, legal, finance) without human verification.
  • Assuming the model is always factually correct or always safe.
  • Using the model to bypass safety systems or to generate disallowed content.

Bias, Risks, and Limitations

Like other language models, QED may produce:

  • Hallucinations (confident but incorrect statements).
  • Pattern repetition from training data.
  • Uneven quality across topics and languages, depending on what the specific checkpoint was trained on.

Mitigations:

  • Use output filtering and constrain the generation strategy when deploying in real applications.
  • Perform domain-specific evaluations before relying on the model.
  • Treat the model as a suggestion engine, not a ground-truth source.

Training Details

This model family was trained with a multi-stage pipeline (pretraining, context-length annealing, and SFT preparation).

High-level training data summary:

  • Pretraining volume: 12.6B tokens.
  • Data is a mixed corpus pipeline configured in the repository and processed into tokenized shards before training.
  • SFT stage uses chat/instruction-style datasets with assistant-targeted supervision.

All training artifacts are published separately at:


Evaluation

We evaluated the following models with a custom evaluation pipeline based on the Hugging Face LightEval harness used in the SmolLM2 model evaluations. The evaluation reports a "general" average over a fixed suite of tasks:

  • MMLU (aggregated over its MMLU subtasks in the LightEval leaderboard)
  • HellaSwag
  • ARC-Challenge
  • Winogrande
  • CommonsenseQA

The numbers below come from all_results_summary.csv produced by the evaluation run.

Model Average (general) arc:challenge commonsense_qa hellaswag winogrande mmlu
HuggingFaceTB/SmolLM2-135M 0.299140 0.283276 0.190827 0.252440 0.519337 0.249822
levossadtchi/QED-75M 0.287318 0.231229 0.204750 0.253336 0.506709 0.240564
EleutherAI/gpt-neo-125m 0.279464 0.191126 0.205569 0.249751 0.521705 0.229170
EleutherAI/pythia-160m-deduped 0.275796 0.202218 0.194922 0.250846 0.501184 0.229811
openai-community/gpt2 0.273993 0.188567 0.196560 0.250249 0.505919 0.228671

compute_vs_score_scatter


Technical Specifications

Model Architecture

QEDForCausalLM is a decoder-only transformer with the following high-level structure:

  • Token embeddings: embed_tokens = Embedding(vocab_size, d_model)
  • n_layers identical blocks (TransformerBlock), each applying:
    • Residual attention: x = x + Attention(RMSNorm(x))
    • Residual MLP: x = x + SwiGLU(RMSNorm(x))
  • Final normalization: norm = RMSNorm(d_model)
  • Output head: lm_head = Linear(d_model, vocab_size, bias=True)

The attention uses RoPE on Q and K and runs causal masking semantics.

Attention and RoPE

  • Projection layers (per attention block):
    • q_proj, k_proj, v_proj, o_proj are Linear(d_model, d_model, bias=config.bias)
  • Number of heads: n_heads
  • Head dimension: head_dim = d_model / n_heads
  • RoPE:
    • Rotary embedding precomputes cos_cached and sin_cached up to max_seq_len
    • RoPE is applied to Q and K using position_ids
  • Attention kernel:
    • Implemented with torch.nn.functional.scaled_dot_product_attention
    • Uses explicit scaling scale = head_dim ** -0.5

MLP (SwiGLU)

The feed-forward sublayer is a SwiGLU variant:

  • gate_proj: Linear(d_model, ffn_hidden_dim)
  • up_proj: Linear(d_model, ffn_hidden_dim)
  • down_proj: Linear(ffn_hidden_dim, d_model)
  • Compute:
    • SwiGLU(x) = down_proj( silu(gate_proj(x)) * up_proj(x) )

Embeddings and Output Head

  • embed_tokens: size [vocab_size, d_model]
  • lm_head: size [d_model, vocab_size] with bias enabled
  • Weight tying:
    • When tie_word_embeddings=True, lm_head.weight is tied to embed_tokens.weight
    • The lm_head bias remains a separate parameter.

Input/Output Interface

Typical usage via Transformers:

  • input_ids: torch.LongTensor of shape [batch_size, seq_len]
  • Optional:
    • position_ids: torch.LongTensor of shape [batch_size, seq_len]
    • attention_mask: torch.Tensor of shape [batch_size, seq_len]
    • labels: torch.LongTensor of shape [batch_size, seq_len] (positions with -100 are ignored)
    • past_key_values: list of length n_layers with cached keys/values
  • Outputs:
    • logits: [batch_size, seq_len, vocab_size]
    • loss: scalar when labels are provided
    • past_key_values: cached KV tensors when use_cache=True

KV Cache and Generation Semantics

  • The model uses a legacy tuple KV cache format (not the newer DynamicCache object). The integration explicitly disables default dynamic cache support (_supports_default_dynamic_cache() returns False).
  • In prepare_inputs_for_generation(...):
    • If past_key_values is provided, generation continues by feeding only the last token (input_ids[:, -1:]).
  • The attention layer concatenates past and current KV along the sequence dimension.

Expected KV shapes (conceptually):

  • For each layer, (key, value) have shape [batch_size, n_heads, kv_len, head_dim].

Attention Masking

When attention_mask is provided, the model converts it to a key-padding boolean mask:

  • key_padding_mask = attention_mask[:, None, None, :].to(torch.bool)

Then it builds:

  • causal constraint (positions cannot attend to future keys)
  • AND with key_padding_mask (mask out padded keys)

Practical recommendation:

  • Use the standard HF convention: attention_mask values should be 1 for real tokens and 0 for padding tokens.

Length Constraints

The model enforces:

  • total_seq_len = past_length + seq_len <= config.max_seq_len

If total_seq_len exceeds max_seq_len, the model raises a ValueError.

Default max_seq_len in the exported config for this checkpoint is 8192.

Default Hyperparameters

The exported config.json for the QED-75M checkpoint sets:

Hyperparameter Value
Approx. parameter count ~75M
n_layers 32
d_model 384
n_heads 6
head_dim 64
ffn_hidden_dim 1024
vocab_size 49152
max_seq_len 8192
rope_theta 10000.0
rms_norm_eps 1e-5
dropout 0.0
tie_word_embeddings true
internal linear bias (QKV/MLP) false

Tokenizer / special tokens (from exported tokenizer_config.json):

  • <pad> id 0
  • <bos> id 1
  • <eos> id 2
  • <unk> id 3

How to Get Started with the Model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "YOUR_ORG/QED-75M"  # replace with your actual Hub repo id

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,  # optional
)

inputs = tokenizer("Once upon a time", return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=64, do_sample=True, top_k=50, temperature=0.8)
print(tokenizer.decode(out[0], skip_special_tokens=True))

For loss computation:

  • pass labels with the same shape as input_ids
  • use -100 in positions you want to ignore.

Model Card Contact

For questions or updates about this model card, use the Issues/Discussions in the code repository or contact the model owner on Hugging Face.

Downloads last month
303
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support