QED-75M

QED-75M is a compact decoder-only causal language model implemented for Hugging Face using a custom transformers module. The model architecture combines RoPE (rotary position embeddings), RMSNorm, SwiGLU feed-forward blocks, and causal self-attention implemented via torch.nn.functional.scaled_dot_product_attention. The token embedding weights can be tied with the output projection (tie_word_embeddings).

This model card focuses on the model itself (architecture, tensor interface, runtime constraints). Training data, training procedure, and export scripts are described in the repository README.md.

Model Details
Uses
Bias, Risks, and Limitations
Training Details
Evaluation
Technical Specifications
How to Get Started with the Model
Citation
Model Card Contact

Model Details

Model Description

QED is a next-token prediction model (causal LM). Given a sequence of token ids, the model produces logits over the vocabulary for each position. When labels are provided, the model computes the training loss as cross-entropy over the next-token targets (with ignore_index=-100).

The Hugging Face integration provides:

QEDConfig (model_type: qed)
QEDForCausalLM

Both classes are defined in the repo module modeling_qed.py and are loaded with trust_remote_code=True.

Model Sources

Code: the repository containing modeling_qed.py and the exported model artifacts.
Transformers implementation: modeling_qed.py (remote code in the model repo).
Training artifacts (checkpoints, logs, and related outputs): levossadtchi/QED-75M_artifacts.

Uses

Direct Use

Text generation using model.generate(...); the repository also includes a ready-to-run local inference script: generate_gravity_example.py.
Scoring / evaluating conditional likelihoods via model(input_ids=..., labels=...).

Downstream Use

Fine-tuning or adapting the model (for example, SFT or LoRA) is technically possible, but quality and safety must be validated for the target domain.

Out-of-Scope Use

Using the model for high-stakes decisions (medical, legal, finance) without human verification.
Assuming the model is always factually correct or always safe.
Using the model to bypass safety systems or to generate disallowed content.

Bias, Risks, and Limitations

Like other language models, QED may produce:

Hallucinations (confident but incorrect statements).
Pattern repetition from training data.
Uneven quality across topics and languages, depending on what the specific checkpoint was trained on.

Mitigations:

Use output filtering and constrain the generation strategy when deploying in real applications.
Perform domain-specific evaluations before relying on the model.
Treat the model as a suggestion engine, not a ground-truth source.

Training Details

This model family was trained with a multi-stage pipeline (pretraining, context-length annealing, and SFT preparation).

High-level training data summary:

Pretraining volume: 12.6B tokens.
Data is a mixed corpus pipeline configured in the repository and processed into tokenized shards before training.
SFT stage uses chat/instruction-style datasets with assistant-targeted supervision.

All training artifacts are published separately at:

levossadtchi/QED-75M_artifacts

Evaluation

We evaluated the following models with a custom evaluation pipeline based on the Hugging Face LightEval harness used in the SmolLM2 model evaluations. The evaluation reports a "general" average over a fixed suite of tasks:

MMLU (aggregated over its MMLU subtasks in the LightEval leaderboard)
HellaSwag
ARC-Challenge
Winogrande
CommonsenseQA

The numbers below come from all_results_summary.csv produced by the evaluation run.

Model	Average (general)	arc:challenge	commonsense_qa	hellaswag	winogrande	mmlu
`HuggingFaceTB/SmolLM2-135M`	0.299140	0.283276	0.190827	0.252440	0.519337	0.249822
`levossadtchi/QED-75M`	0.287318	0.231229	0.204750	0.253336	0.506709	0.240564
`EleutherAI/gpt-neo-125m`	0.279464	0.191126	0.205569	0.249751	0.521705	0.229170
`EleutherAI/pythia-160m-deduped`	0.275796	0.202218	0.194922	0.250846	0.501184	0.229811
`openai-community/gpt2`	0.273993	0.188567	0.196560	0.250249	0.505919	0.228671

Technical Specifications

Model Architecture

QEDForCausalLM is a decoder-only transformer with the following high-level structure:

Token embeddings: embed_tokens = Embedding(vocab_size, d_model)
n_layers identical blocks (TransformerBlock), each applying:
- Residual attention: x = x + Attention(RMSNorm(x))
- Residual MLP: x = x + SwiGLU(RMSNorm(x))
Final normalization: norm = RMSNorm(d_model)
Output head: lm_head = Linear(d_model, vocab_size, bias=True)

The attention uses RoPE on Q and K and runs causal masking semantics.

Attention and RoPE

Projection layers (per attention block):
- q_proj, k_proj, v_proj, o_proj are Linear(d_model, d_model, bias=config.bias)
Number of heads: n_heads
Head dimension: head_dim = d_model / n_heads
RoPE:
- Rotary embedding precomputes cos_cached and sin_cached up to max_seq_len
- RoPE is applied to Q and K using position_ids
Attention kernel:
- Implemented with torch.nn.functional.scaled_dot_product_attention
- Uses explicit scaling scale = head_dim ** -0.5

MLP (SwiGLU)

The feed-forward sublayer is a SwiGLU variant:

gate_proj: Linear(d_model, ffn_hidden_dim)
up_proj: Linear(d_model, ffn_hidden_dim)
down_proj: Linear(ffn_hidden_dim, d_model)
Compute:
- SwiGLU(x) = down_proj( silu(gate_proj(x)) * up_proj(x) )

Embeddings and Output Head

embed_tokens: size [vocab_size, d_model]
lm_head: size [d_model, vocab_size] with bias enabled
Weight tying:
- When tie_word_embeddings=True, lm_head.weight is tied to embed_tokens.weight
- The lm_head bias remains a separate parameter.

Input/Output Interface

Typical usage via Transformers:

input_ids: torch.LongTensor of shape [batch_size, seq_len]
Optional:
- position_ids: torch.LongTensor of shape [batch_size, seq_len]
- attention_mask: torch.Tensor of shape [batch_size, seq_len]
- labels: torch.LongTensor of shape [batch_size, seq_len] (positions with -100 are ignored)
- past_key_values: list of length n_layers with cached keys/values
Outputs:
- logits: [batch_size, seq_len, vocab_size]
- loss: scalar when labels are provided
- past_key_values: cached KV tensors when use_cache=True

KV Cache and Generation Semantics

The model uses a legacy tuple KV cache format (not the newer DynamicCache object). The integration explicitly disables default dynamic cache support (_supports_default_dynamic_cache() returns False).
In prepare_inputs_for_generation(...):
- If past_key_values is provided, generation continues by feeding only the last token (input_ids[:, -1:]).
The attention layer concatenates past and current KV along the sequence dimension.

Expected KV shapes (conceptually):

For each layer, (key, value) have shape [batch_size, n_heads, kv_len, head_dim].

Attention Masking

When attention_mask is provided, the model converts it to a key-padding boolean mask:

key_padding_mask = attention_mask[:, None, None, :].to(torch.bool)

Then it builds:

causal constraint (positions cannot attend to future keys)
AND with key_padding_mask (mask out padded keys)

Practical recommendation:

Use the standard HF convention: attention_mask values should be 1 for real tokens and 0 for padding tokens.

Length Constraints

The model enforces:

total_seq_len = past_length + seq_len <= config.max_seq_len

If total_seq_len exceeds max_seq_len, the model raises a ValueError.

Default max_seq_len in the exported config for this checkpoint is 8192.

Default Hyperparameters

The exported config.json for the QED-75M checkpoint sets:

Hyperparameter	Value
Approx. parameter count	~75M
`n_layers`	32
`d_model`	384
`n_heads`	6
`head_dim`	64
`ffn_hidden_dim`	1024
`vocab_size`	49152
`max_seq_len`	8192
`rope_theta`	10000.0
`rms_norm_eps`	1e-5
`dropout`	0.0
`tie_word_embeddings`	true
internal linear `bias` (QKV/MLP)	false

Tokenizer / special tokens (from exported tokenizer_config.json):

<pad> id 0
<bos> id 1
<eos> id 2
<unk> id 3

How to Get Started with the Model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "YOUR_ORG/QED-75M"  # replace with your actual Hub repo id

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,  # optional
)

inputs = tokenizer("Once upon a time", return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=64, do_sample=True, top_k=50, temperature=0.8)
print(tokenizer.decode(out[0], skip_special_tokens=True))

For loss computation:

pass labels with the same shape as input_ids
use -100 in positions you want to ignore.

Model Card Contact

For questions or updates about this model card, use the Issues/Discussions in the code repository or contact the model owner on Hugging Face.

Downloads last month: 303

levossadtchi
/

QED-75M