Model Card for Model ID
license: mit datasets: - dpavlis/ctl_lora_sft_data base_model: - Qwen/Qwen2.5-Coder-7B-Instruct
dpavlis/Qwen2.5-Coder-7B-Instruct-CTL
A LoRA fine-tuned version of Qwen2.5-Coder-7B-Instruct specialised in CTL2 (Clover Transformation Language 2) — the domain-specific language used for data transformations in CloverDX ETL pipelines.
Model Details
Model Description
This model assists developers writing CTL2 transformation code inside CloverDX. It can generate correct CTL2 expressions and transformation logic, explain built-in functions, help with component binding patterns, and answer questions about the language. It is not intended as a general-purpose coding assistant.
- Developed by: David Pavlis (dpavlis)
- Model type: Causal LM, LoRA adapter over Qwen2.5-Coder-7B-Instruct
- Language(s): English (instructions), CTL2 (generated code)
- License: MIT
- Finetuned from: Qwen/Qwen2.5-Coder-7B-Instruct
Model Sources
- Repository: dpavlis/Qwen2.5-Coder-7B-Instruct-CTL
- Training dataset: dpavlis/ctl_lora_sft_data
Uses
Direct Use
The model is designed to be used as a coding assistant for CloverDX CTL2 development. Typical use cases include:
- Generating CTL2 transformation expressions from a natural language description
- Implementing component binding logic (Reader, Writer, Joiner, Filter, Reformat, Aggregator)
- Using built-in CTL2 functions correctly — string manipulation, date arithmetic, type conversion, map/list operations, JSON access, regex, hashing
- Explaining what a given CTL2 expression or function does
- Answering questions about CTL2 language features, null handling, and type system behaviour
Downstream Use
The model can be integrated into CloverDX developer tooling, IDE plugins, or internal chat assistants to provide context-aware CTL2 code suggestions and explanations.
Out-of-Scope Use
- General-purpose code generation in Python, Java, SQL or other languages
- ETL graph design or multi-component orchestration reasoning
- Tasks unrelated to CTL2 or CloverDX transformations
- Production use without human review of generated code
Bias, Risks, and Limitations
- The model is heavily biased toward CTL2 code generation. It may default to producing a code block even for questions that warrant a prose explanation.
- Coverage of rarely used built-in functions may be uneven despite targeted augmentation efforts.
- The model does not have access to runtime data or live CloverDX metadata — it cannot inspect actual graph configurations or metadata at inference time.
- All generated CTL2 code should be reviewed before use in production transformations.
Recommendations
Always validate generated CTL2 expressions against your actual field types and metadata. The model does not know your specific graph schema unless it is provided in the prompt.
How to Get Started with the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model_id = "Qwen/Qwen2.5-Coder-7B-Instruct"
adapter_id = "dpavlis/Qwen2.5-Coder-7B-Instruct-CTL"
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = AutoModelForCausalLM.from_pretrained(base_model_id, torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(model, adapter_id)
messages = [
{"role": "system", "content": "You are an expert in CloverDX CTL2 transformation language."},
{"role": "user", "content": "Write a CTL2 expression that trims whitespace from a string field and converts it to uppercase."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Training Details
Training Data
Training data is available at dpavlis/ctl_lora_sft_data.
The dataset contains approximately 7,085 supervised fine-tuning examples covering:
- CTL2 code generation — the majority of examples; user describes a transformation goal,
model generates correct CTL2 code with the
//#CTL2directive - Built-in function usage — targeted examples for all built-in functions across string, math, date, conversion, map/list, JSON, regex, and hashing categories; every function has at least 30 training examples
- Component binding patterns — examples for the 6 core CloverDX components (Reader, Writer, Joiner, Filter, Reformat, Aggregator) with their specific CTL2 function signature requirements
- Language bridge examples — translations of common patterns from Java, Python, and SQL into equivalent CTL2
- LLM-augmented examples — synthetically generated examples using GPT-4 to improve coverage of less common functions and edge cases (null inputs, type coercion, chained expressions)
Training Procedure
Fine-tuning was performed using LLaMA-Factory with the following setup after iterative hyperparameter optimisation across 5 training runs.
Training Hyperparameters
| Parameter | Value |
|---|---|
| Training regime | fp16 mixed precision |
| Fine-tuning method | LoRA (SFT stage) |
| LoRA rank | 64 |
| LoRA alpha | 128 |
| LoRA dropout | 0.07 |
| LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable parameters | 161,480,704 |
| Learning rate | 3e-5 |
| LR scheduler | cosine_with_restarts (3 cycles) |
| Epochs | 3 |
| Total optimisation steps | 1,182 |
| Batch size per device | 3 |
| Gradient accumulation steps | 3 |
| Effective total batch size | 18 |
| Warmup steps | 80 |
| Weight decay | 0.05 |
| Max grad norm | 1.0 |
| Sequence cutoff length | 1,700 tokens |
| Optimizer | adamw_torch |
| Thinking mode | Enabled (enable_thinking True) |
Speeds, Sizes, Times
- Hardware: 2× NVIDIA GPU (multi-GPU training via DDP)
- LoRA adapter size: ~620 MB
- Training time: approximately 3–4 hours for the final run
Evaluation
Testing Data
A held-out evaluation dataset (clover_ctl_eval_data) was used throughout training,
separate from all training sets. Evaluation was run every 150 steps using the HuggingFace
Trainer eval_loss metric. The best checkpoint was selected automatically via
load_best_model_at_end=True.
Metrics
Evaluation was tracked using cross-entropy loss on the held-out CTL2 eval set.
Results
| Metric | Value |
|---|---|
| Training samples | 7,085+ |
| Total optimisation steps | 1,182 |
| Best eval loss | 0.1475 |
Summary
The model reliably generates syntactically correct CTL2 code for the most common transformation patterns. It performs best on string manipulation, date arithmetic, type conversion, and single-component binding tasks. Complex chained expressions and rarely used functions may still require review.
Beyond code generation, the model is also capable of engaging in conversational interactions — explaining what a function does, listing available functions for a given purpose, describing language concepts such as null propagation or type coercion, and recommending the right function for a described task. It recognises the difference between a request to write code and a question that warrants a prose explanation, and responds appropriately in both cases.
Technical Specifications
Model Architecture and Objective
- Base architecture: Qwen2.5-Coder-7B-Instruct (transformer decoder, 7B parameters)
- Adaptation method: Low-Rank Adaptation (LoRA), rank 64, alpha 128, applied to all attention and MLP projection layers
- Objective: Next-token prediction (SFT) on CTL2 instruction-following examples
Compute Infrastructure
Hardware
- 2× NVIDIA GPU (multi-GPU, DDP)
Software
- LLaMA-Factory
- PEFT
- Transformers
- Python 3.10+, PyTorch, CUDA
Citation
If you use this model, please cite the base model:
@misc{hui2024qwen25codertechnicalreport,
title={Qwen2.5-Coder Technical Report},
author={Binyuan Hui et al.},
year={2024},
eprint={2409.12186},
archivePrefix={arXiv}
}
Model Card Authors
David Pavlis (dpavlis)
Model Card Contact
Please open an issue on the Hugging Face repository for questions or bug reports.
- Downloads last month
- 12