Update README.md

931f8c2 verified 4 days ago

3.65 kB

base_model:
  - JeloH/xGenq-qwen2.5-coder-7b-instruct-OKI
library_name: peft

Model Card for JeloH/LLM4CodeRE-S2S-V1

LLM4CodeRE-S2S-V1 is a PEFT fine-tuned causal language model for reverse-engineering-oriented code translation tasks. It supports sequence-to-sequence style prompting for mapping between source code, assembly, and binary-related representations.

Model Details

Model Description

LLM4CodeRE-S2S-V1 is a multi-task reverse engineering model built on top of JeloH/xGenq-qwen2.5-coder-7b-instruct-OKI. It was fine-tuned using LoRA on instruction-style code translation tasks, including assembly-to-source and source-to-assembly conversion, along with binary-related code transformation tasks.

The model uses a causal language modeling objective with sequence-to-sequence style prompting. Here, “S2S” refers to prompt-based input-output translation within a single causal sequence, rather than a traditional encoder-decoder architecture.

Uses

Direct Use

This model is intended for research and experimental reverse-engineering tasks, including:

Assembly to source code (Asm2Src)
Source code to assembly (Src2Asm)
Binary to source code (Binary2Src)
Source code to binary (Src2Binary)
Binary to assembly (Binary2Asm)

Downstream Use [optional]

Potential downstream uses include:

reverse engineering research
code translation experiments
educational use in code understanding
program analysis and representation learning pipelines

Results

Citation

Jelodar, H., Bai, S., Nwankwo, T. E., Hamedi, P., Meymani, M., Razavi-Far, R., & Ghorbani, A. A. (2026). LLM4CodeRE: Generative AI for code decompilation analysis and reverse engineering. arXiv. https://doi.org/10.48550/arXiv.2604.06095

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "JeloH/LLM4CodeRE-S2S-V1"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)
model.eval()

def generate_output(task, input_text):
    if task == "Asm2Src":
        prompt = f"Task: Asm2Src. Convert assembly to C/C++:\n\n{input_text}\n\nSource code:"
    elif task == "Src2Asm":
        prompt = f"Task: Src2Asm. Convert C/C++ to assembly:\n\n{input_text}\n\nAssembly:"
    else:
        raise ValueError("Only Asm2Src and Src2Asm are supported in this example")

    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,
            do_sample=False
        )

    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return result[len(prompt):].strip()

src_code_example = """// Main function
int main() {
    return 0;
}
"""

asm_code_example = """push ebp
mov ebp, esp
mov eax, 0
pop ebp
ret
"""

for task, text in [("Src2Asm", src_code_example), ("Asm2Src", asm_code_example)]:
    print(f"\n===== {task} =====\n")
    print("INPUT:\n", text)
    print("\nOUTPUT:\n", generate_output(task, text))```