Model Card for Model ID


license: mit datasets: - dpavlis/ctl_lora_sft_data base_model: - Qwen/Qwen2.5-Coder-7B-Instruct

dpavlis/Qwen2.5-Coder-7B-Instruct-CTL

A LoRA fine-tuned version of Qwen2.5-Coder-7B-Instruct specialised in CTL2 (Clover Transformation Language 2) — the domain-specific language used for data transformations in CloverDX ETL pipelines.

Model Details

Model Description

This model assists developers writing CTL2 transformation code inside CloverDX. It can generate correct CTL2 expressions and transformation logic, explain built-in functions, help with component binding patterns, and answer questions about the language. It is not intended as a general-purpose coding assistant.

  • Developed by: David Pavlis (dpavlis)
  • Model type: Causal LM, LoRA adapter over Qwen2.5-Coder-7B-Instruct
  • Language(s): English (instructions), CTL2 (generated code)
  • License: MIT
  • Finetuned from: Qwen/Qwen2.5-Coder-7B-Instruct

Model Sources


Uses

Direct Use

The model is designed to be used as a coding assistant for CloverDX CTL2 development. Typical use cases include:

  • Generating CTL2 transformation expressions from a natural language description
  • Implementing component binding logic (Reader, Writer, Joiner, Filter, Reformat, Aggregator)
  • Using built-in CTL2 functions correctly — string manipulation, date arithmetic, type conversion, map/list operations, JSON access, regex, hashing
  • Explaining what a given CTL2 expression or function does
  • Answering questions about CTL2 language features, null handling, and type system behaviour

Downstream Use

The model can be integrated into CloverDX developer tooling, IDE plugins, or internal chat assistants to provide context-aware CTL2 code suggestions and explanations.

Out-of-Scope Use

  • General-purpose code generation in Python, Java, SQL or other languages
  • ETL graph design or multi-component orchestration reasoning
  • Tasks unrelated to CTL2 or CloverDX transformations
  • Production use without human review of generated code

Bias, Risks, and Limitations

  • The model is heavily biased toward CTL2 code generation. It may default to producing a code block even for questions that warrant a prose explanation.
  • Coverage of rarely used built-in functions may be uneven despite targeted augmentation efforts.
  • The model does not have access to runtime data or live CloverDX metadata — it cannot inspect actual graph configurations or metadata at inference time.
  • All generated CTL2 code should be reviewed before use in production transformations.

Recommendations

Always validate generated CTL2 expressions against your actual field types and metadata. The model does not know your specific graph schema unless it is provided in the prompt.


How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model_id = "Qwen/Qwen2.5-Coder-7B-Instruct"
adapter_id    = "dpavlis/Qwen2.5-Coder-7B-Instruct-CTL"

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model     = AutoModelForCausalLM.from_pretrained(base_model_id, torch_dtype="auto", device_map="auto")
model     = PeftModel.from_pretrained(model, adapter_id)

messages = [
    {"role": "system", "content": "You are an expert in CloverDX CTL2 transformation language."},
    {"role": "user",   "content": "Write a CTL2 expression that trims whitespace from a string field and converts it to uppercase."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Training Details

Training Data

Training data is available at dpavlis/ctl_lora_sft_data.

The dataset contains approximately 7,085 supervised fine-tuning examples covering:

  • CTL2 code generation — the majority of examples; user describes a transformation goal, model generates correct CTL2 code with the //#CTL2 directive
  • Built-in function usage — targeted examples for all built-in functions across string, math, date, conversion, map/list, JSON, regex, and hashing categories; every function has at least 30 training examples
  • Component binding patterns — examples for the 6 core CloverDX components (Reader, Writer, Joiner, Filter, Reformat, Aggregator) with their specific CTL2 function signature requirements
  • Language bridge examples — translations of common patterns from Java, Python, and SQL into equivalent CTL2
  • LLM-augmented examples — synthetically generated examples using GPT-4 to improve coverage of less common functions and edge cases (null inputs, type coercion, chained expressions)

Training Procedure

Fine-tuning was performed using LLaMA-Factory with the following setup after iterative hyperparameter optimisation across 5 training runs.

Training Hyperparameters

Parameter Value
Training regime fp16 mixed precision
Fine-tuning method LoRA (SFT stage)
LoRA rank 64
LoRA alpha 128
LoRA dropout 0.07
LoRA target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters 161,480,704
Learning rate 3e-5
LR scheduler cosine_with_restarts (3 cycles)
Epochs 3
Total optimisation steps 1,182
Batch size per device 3
Gradient accumulation steps 3
Effective total batch size 18
Warmup steps 80
Weight decay 0.05
Max grad norm 1.0
Sequence cutoff length 1,700 tokens
Optimizer adamw_torch
Thinking mode Enabled (enable_thinking True)

Speeds, Sizes, Times

  • Hardware: 2× NVIDIA GPU (multi-GPU training via DDP)
  • LoRA adapter size: ~620 MB
  • Training time: approximately 3–4 hours for the final run

Evaluation

Testing Data

A held-out evaluation dataset (clover_ctl_eval_data) was used throughout training, separate from all training sets. Evaluation was run every 150 steps using the HuggingFace Trainer eval_loss metric. The best checkpoint was selected automatically via load_best_model_at_end=True.

Metrics

Evaluation was tracked using cross-entropy loss on the held-out CTL2 eval set.

Results

Metric Value
Training samples 7,085+
Total optimisation steps 1,182
Best eval loss 0.1475

Summary

The model reliably generates syntactically correct CTL2 code for the most common transformation patterns. It performs best on string manipulation, date arithmetic, type conversion, and single-component binding tasks. Complex chained expressions and rarely used functions may still require review.

Beyond code generation, the model is also capable of engaging in conversational interactions — explaining what a function does, listing available functions for a given purpose, describing language concepts such as null propagation or type coercion, and recommending the right function for a described task. It recognises the difference between a request to write code and a question that warrants a prose explanation, and responds appropriately in both cases.


Technical Specifications

Model Architecture and Objective

  • Base architecture: Qwen2.5-Coder-7B-Instruct (transformer decoder, 7B parameters)
  • Adaptation method: Low-Rank Adaptation (LoRA), rank 64, alpha 128, applied to all attention and MLP projection layers
  • Objective: Next-token prediction (SFT) on CTL2 instruction-following examples

Compute Infrastructure

Hardware

  • 2× NVIDIA GPU (multi-GPU, DDP)

Software


Citation

If you use this model, please cite the base model:

@misc{hui2024qwen25codertechnicalreport,
  title={Qwen2.5-Coder Technical Report},
  author={Binyuan Hui et al.},
  year={2024},
  eprint={2409.12186},
  archivePrefix={arXiv}
}

Model Card Authors

David Pavlis (dpavlis)

Model Card Contact

Please open an issue on the Hugging Face repository for questions or bug reports.

Downloads last month
12
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train dpavlis/Qwen2.5-Coder-7B-Instruct-CTL

Paper for dpavlis/Qwen2.5-Coder-7B-Instruct-CTL