Model Card for Model ID

license: mit datasets: - dpavlis/ctl_lora_sft_data base_model: - Qwen/Qwen2.5-Coder-7B-Instruct

dpavlis/Qwen2.5-Coder-7B-Instruct-CTL

A LoRA fine-tuned version of Qwen2.5-Coder-7B-Instruct specialised in CTL2 (Clover Transformation Language 2) — the domain-specific language used for data transformations in CloverDX ETL pipelines.

Model Details

Model Description

This model assists developers writing CTL2 transformation code inside CloverDX. It can generate correct CTL2 expressions and transformation logic, explain built-in functions, help with component binding patterns, and answer questions about the language. It is not intended as a general-purpose coding assistant.

Developed by: David Pavlis (dpavlis)
Model type: Causal LM, LoRA adapter over Qwen2.5-Coder-7B-Instruct
Language(s): English (instructions), CTL2 (generated code)
License: MIT
Finetuned from: Qwen/Qwen2.5-Coder-7B-Instruct

Model Sources

Repository: dpavlis/Qwen2.5-Coder-7B-Instruct-CTL
Training dataset: dpavlis/ctl_lora_sft_data

Uses

Direct Use

The model is designed to be used as a coding assistant for CloverDX CTL2 development. Typical use cases include:

Generating CTL2 transformation expressions from a natural language description
Implementing component binding logic (Reader, Writer, Joiner, Filter, Reformat, Aggregator)
Using built-in CTL2 functions correctly — string manipulation, date arithmetic, type conversion, map/list operations, JSON access, regex, hashing
Explaining what a given CTL2 expression or function does
Answering questions about CTL2 language features, null handling, and type system behaviour

Downstream Use

The model can be integrated into CloverDX developer tooling, IDE plugins, or internal chat assistants to provide context-aware CTL2 code suggestions and explanations.

Out-of-Scope Use

General-purpose code generation in Python, Java, SQL or other languages
ETL graph design or multi-component orchestration reasoning
Tasks unrelated to CTL2 or CloverDX transformations
Production use without human review of generated code

Bias, Risks, and Limitations

The model is heavily biased toward CTL2 code generation. It may default to producing a code block even for questions that warrant a prose explanation.
Coverage of rarely used built-in functions may be uneven despite targeted augmentation efforts.
The model does not have access to runtime data or live CloverDX metadata — it cannot inspect actual graph configurations or metadata at inference time.
All generated CTL2 code should be reviewed before use in production transformations.

Recommendations

Always validate generated CTL2 expressions against your actual field types and metadata. The model does not know your specific graph schema unless it is provided in the prompt.

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model_id = "Qwen/Qwen2.5-Coder-7B-Instruct"
adapter_id    = "dpavlis/Qwen2.5-Coder-7B-Instruct-CTL"

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model     = AutoModelForCausalLM.from_pretrained(base_model_id, torch_dtype="auto", device_map="auto")
model     = PeftModel.from_pretrained(model, adapter_id)

messages = [
    {"role": "system", "content": "You are an expert in CloverDX CTL2 transformation language."},
    {"role": "user",   "content": "Write a CTL2 expression that trims whitespace from a string field and converts it to uppercase."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Training Details

Training Data

Training data is available at dpavlis/ctl_lora_sft_data.

The dataset contains approximately 7,085 supervised fine-tuning examples covering:

CTL2 code generation — the majority of examples; user describes a transformation goal, model generates correct CTL2 code with the //#CTL2 directive
Built-in function usage — targeted examples for all built-in functions across string, math, date, conversion, map/list, JSON, regex, and hashing categories; every function has at least 30 training examples
Component binding patterns — examples for the 6 core CloverDX components (Reader, Writer, Joiner, Filter, Reformat, Aggregator) with their specific CTL2 function signature requirements
Language bridge examples — translations of common patterns from Java, Python, and SQL into equivalent CTL2
LLM-augmented examples — synthetically generated examples using GPT-4 to improve coverage of less common functions and edge cases (null inputs, type coercion, chained expressions)

Training Procedure

Fine-tuning was performed using LLaMA-Factory with the following setup after iterative hyperparameter optimisation across 5 training runs.

Training Hyperparameters

Parameter	Value
Training regime	fp16 mixed precision
Fine-tuning method	LoRA (SFT stage)
LoRA rank	64
LoRA alpha	128
LoRA dropout	0.07
LoRA target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters	161,480,704
Learning rate	3e-5
LR scheduler	cosine_with_restarts (3 cycles)
Epochs	3
Total optimisation steps	1,182
Batch size per device	3
Gradient accumulation steps	3
Effective total batch size	18
Warmup steps	80
Weight decay	0.05
Max grad norm	1.0
Sequence cutoff length	1,700 tokens
Optimizer	adamw_torch
Thinking mode	Enabled (enable_thinking True)

Speeds, Sizes, Times

Hardware: 2× NVIDIA GPU (multi-GPU training via DDP)
LoRA adapter size: ~620 MB
Training time: approximately 3–4 hours for the final run

Evaluation

Testing Data

A held-out evaluation dataset (clover_ctl_eval_data) was used throughout training, separate from all training sets. Evaluation was run every 150 steps using the HuggingFace Trainer eval_loss metric. The best checkpoint was selected automatically via load_best_model_at_end=True.

Metrics

Evaluation was tracked using cross-entropy loss on the held-out CTL2 eval set.

Results

Metric	Value
Training samples	7,085+
Total optimisation steps	1,182
Best eval loss	0.1475

Summary

The model reliably generates syntactically correct CTL2 code for the most common transformation patterns. It performs best on string manipulation, date arithmetic, type conversion, and single-component binding tasks. Complex chained expressions and rarely used functions may still require review.

Beyond code generation, the model is also capable of engaging in conversational interactions — explaining what a function does, listing available functions for a given purpose, describing language concepts such as null propagation or type coercion, and recommending the right function for a described task. It recognises the difference between a request to write code and a question that warrants a prose explanation, and responds appropriately in both cases.

Technical Specifications

Model Architecture and Objective

Base architecture: Qwen2.5-Coder-7B-Instruct (transformer decoder, 7B parameters)
Adaptation method: Low-Rank Adaptation (LoRA), rank 64, alpha 128, applied to all attention and MLP projection layers
Objective: Next-token prediction (SFT) on CTL2 instruction-following examples

Compute Infrastructure

Hardware

2× NVIDIA GPU (multi-GPU, DDP)

Software

LLaMA-Factory
PEFT
Transformers
Python 3.10+, PyTorch, CUDA

Citation

If you use this model, please cite the base model:

@misc{hui2024qwen25codertechnicalreport,
  title={Qwen2.5-Coder Technical Report},
  author={Binyuan Hui et al.},
  year={2024},
  eprint={2409.12186},
  archivePrefix={arXiv}
}

Model Card Authors

David Pavlis (dpavlis)

Model Card Contact

Please open an issue on the Hugging Face repository for questions or bug reports.

Downloads last month: 12

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train dpavlis/Qwen2.5-Coder-7B-Instruct-CTL

Paper for dpavlis/Qwen2.5-Coder-7B-Instruct-CTL

Qwen2.5-Coder Technical Report

Paper • 2409.12186 • Published Sep 18, 2024 • 153