QED-75M
QED-75M is a compact decoder-only causal language model implemented for Hugging Face using a custom transformers module. The model architecture combines RoPE (rotary position embeddings), RMSNorm, SwiGLU feed-forward blocks, and causal self-attention implemented via torch.nn.functional.scaled_dot_product_attention. The token embedding weights can be tied with the output projection (tie_word_embeddings).
This model card focuses on the model itself (architecture, tensor interface, runtime constraints). Training data, training procedure, and export scripts are described in the repository README.md.
Table of Contents
- Model Details
- Uses
- Bias, Risks, and Limitations
- Training Details
- Evaluation
- Technical Specifications
- How to Get Started with the Model
- Citation
- Model Card Contact
Model Details
Model Description
QED is a next-token prediction model (causal LM). Given a sequence of token ids, the model produces logits over the vocabulary for each position. When labels are provided, the model computes the training loss as cross-entropy over the next-token targets (with ignore_index=-100).
The Hugging Face integration provides:
QEDConfig(model_type: qed)QEDForCausalLM
Both classes are defined in the repo module modeling_qed.py and are loaded with trust_remote_code=True.
Model Sources
- Code: the repository containing
modeling_qed.pyand the exported model artifacts. - Transformers implementation:
modeling_qed.py(remote code in the model repo). - Training artifacts (checkpoints, logs, and related outputs): levossadtchi/QED-75M_artifacts.
Uses
Direct Use
- Text generation using
model.generate(...); the repository also includes a ready-to-run local inference script:generate_gravity_example.py. - Scoring / evaluating conditional likelihoods via
model(input_ids=..., labels=...).
Downstream Use
- Fine-tuning or adapting the model (for example, SFT or LoRA) is technically possible, but quality and safety must be validated for the target domain.
Out-of-Scope Use
- Using the model for high-stakes decisions (medical, legal, finance) without human verification.
- Assuming the model is always factually correct or always safe.
- Using the model to bypass safety systems or to generate disallowed content.
Bias, Risks, and Limitations
Like other language models, QED may produce:
- Hallucinations (confident but incorrect statements).
- Pattern repetition from training data.
- Uneven quality across topics and languages, depending on what the specific checkpoint was trained on.
Mitigations:
- Use output filtering and constrain the generation strategy when deploying in real applications.
- Perform domain-specific evaluations before relying on the model.
- Treat the model as a suggestion engine, not a ground-truth source.
Training Details
This model family was trained with a multi-stage pipeline (pretraining, context-length annealing, and SFT preparation).
High-level training data summary:
- Pretraining volume: 12.6B tokens.
- Data is a mixed corpus pipeline configured in the repository and processed into tokenized shards before training.
- SFT stage uses chat/instruction-style datasets with assistant-targeted supervision.
All training artifacts are published separately at:
Evaluation
We evaluated the following models with a custom evaluation pipeline based on the Hugging Face LightEval harness used in the SmolLM2 model evaluations. The evaluation reports a "general" average over a fixed suite of tasks:
MMLU(aggregated over its MMLU subtasks in the LightEval leaderboard)HellaSwagARC-ChallengeWinograndeCommonsenseQA
The numbers below come from all_results_summary.csv produced by the evaluation run.
| Model | Average (general) | arc:challenge | commonsense_qa | hellaswag | winogrande | mmlu |
|---|---|---|---|---|---|---|
HuggingFaceTB/SmolLM2-135M |
0.299140 | 0.283276 | 0.190827 | 0.252440 | 0.519337 | 0.249822 |
levossadtchi/QED-75M |
0.287318 | 0.231229 | 0.204750 | 0.253336 | 0.506709 | 0.240564 |
EleutherAI/gpt-neo-125m |
0.279464 | 0.191126 | 0.205569 | 0.249751 | 0.521705 | 0.229170 |
EleutherAI/pythia-160m-deduped |
0.275796 | 0.202218 | 0.194922 | 0.250846 | 0.501184 | 0.229811 |
openai-community/gpt2 |
0.273993 | 0.188567 | 0.196560 | 0.250249 | 0.505919 | 0.228671 |
Technical Specifications
Model Architecture
QEDForCausalLM is a decoder-only transformer with the following high-level structure:
- Token embeddings:
embed_tokens = Embedding(vocab_size, d_model) n_layersidentical blocks (TransformerBlock), each applying:- Residual attention:
x = x + Attention(RMSNorm(x)) - Residual MLP:
x = x + SwiGLU(RMSNorm(x))
- Residual attention:
- Final normalization:
norm = RMSNorm(d_model) - Output head:
lm_head = Linear(d_model, vocab_size, bias=True)
The attention uses RoPE on Q and K and runs causal masking semantics.
Attention and RoPE
- Projection layers (per attention block):
q_proj,k_proj,v_proj,o_projareLinear(d_model, d_model, bias=config.bias)
- Number of heads:
n_heads - Head dimension:
head_dim = d_model / n_heads - RoPE:
- Rotary embedding precomputes
cos_cachedandsin_cachedup tomax_seq_len - RoPE is applied to Q and K using
position_ids
- Rotary embedding precomputes
- Attention kernel:
- Implemented with
torch.nn.functional.scaled_dot_product_attention - Uses explicit scaling
scale = head_dim ** -0.5
- Implemented with
MLP (SwiGLU)
The feed-forward sublayer is a SwiGLU variant:
gate_proj: Linear(d_model, ffn_hidden_dim)up_proj: Linear(d_model, ffn_hidden_dim)down_proj: Linear(ffn_hidden_dim, d_model)- Compute:
SwiGLU(x) = down_proj( silu(gate_proj(x)) * up_proj(x) )
Embeddings and Output Head
embed_tokens: size[vocab_size, d_model]lm_head: size[d_model, vocab_size]with bias enabled- Weight tying:
- When
tie_word_embeddings=True,lm_head.weightis tied toembed_tokens.weight - The
lm_headbias remains a separate parameter.
- When
Input/Output Interface
Typical usage via Transformers:
input_ids:torch.LongTensorof shape[batch_size, seq_len]- Optional:
position_ids:torch.LongTensorof shape[batch_size, seq_len]attention_mask:torch.Tensorof shape[batch_size, seq_len]labels:torch.LongTensorof shape[batch_size, seq_len](positions with-100are ignored)past_key_values: list of lengthn_layerswith cached keys/values
- Outputs:
logits:[batch_size, seq_len, vocab_size]loss: scalar whenlabelsare providedpast_key_values: cached KV tensors whenuse_cache=True
KV Cache and Generation Semantics
- The model uses a legacy tuple KV cache format (not the newer
DynamicCacheobject). The integration explicitly disables default dynamic cache support (_supports_default_dynamic_cache()returnsFalse). - In
prepare_inputs_for_generation(...):- If
past_key_valuesis provided, generation continues by feeding only the last token (input_ids[:, -1:]).
- If
- The attention layer concatenates past and current KV along the sequence dimension.
Expected KV shapes (conceptually):
- For each layer,
(key, value)have shape[batch_size, n_heads, kv_len, head_dim].
Attention Masking
When attention_mask is provided, the model converts it to a key-padding boolean mask:
key_padding_mask = attention_mask[:, None, None, :].to(torch.bool)
Then it builds:
- causal constraint (positions cannot attend to future keys)
- AND with
key_padding_mask(mask out padded keys)
Practical recommendation:
- Use the standard HF convention:
attention_maskvalues should be1for real tokens and0for padding tokens.
Length Constraints
The model enforces:
total_seq_len = past_length + seq_len <= config.max_seq_len
If total_seq_len exceeds max_seq_len, the model raises a ValueError.
Default max_seq_len in the exported config for this checkpoint is 8192.
Default Hyperparameters
The exported config.json for the QED-75M checkpoint sets:
| Hyperparameter | Value |
|---|---|
| Approx. parameter count | ~75M |
n_layers |
32 |
d_model |
384 |
n_heads |
6 |
head_dim |
64 |
ffn_hidden_dim |
1024 |
vocab_size |
49152 |
max_seq_len |
8192 |
rope_theta |
10000.0 |
rms_norm_eps |
1e-5 |
dropout |
0.0 |
tie_word_embeddings |
true |
internal linear bias (QKV/MLP) |
false |
Tokenizer / special tokens (from exported tokenizer_config.json):
<pad>id0<bos>id1<eos>id2<unk>id3
How to Get Started with the Model
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "YOUR_ORG/QED-75M" # replace with your actual Hub repo id
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16, # optional
)
inputs = tokenizer("Once upon a time", return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=64, do_sample=True, top_k=50, temperature=0.8)
print(tokenizer.decode(out[0], skip_special_tokens=True))
For loss computation:
- pass
labelswith the same shape asinput_ids - use
-100in positions you want to ignore.
Model Card Contact
For questions or updates about this model card, use the Issues/Discussions in the code repository or contact the model owner on Hugging Face.
- Downloads last month
- 303

