Blowfish
Introduction
**Blowfishλ λΆμ λ μ± μμΈ‘μ μννκΈ° μν΄ κ°λ°λ λν μΈμ΄ λͺ¨λΈ(LLM)μ λλ€.
Qwen3-14Bλ₯Ό κΈ°λ°μΌλ‘ νμΈνλ(Fine-tuning)λμμΌλ©°, λ¨μν μ΄μ§ λΆλ₯λ₯Ό λμ΄ Chain-of-Thought (CoT) λ°©μμ ν΅ν΄ λ μ± νμ μ ννμ /μλ¬Όνμ κ·Όκ±°λ₯Ό λ Όλ¦¬μ μΌλ‘ μ€λͺ ν©λλ€.
μ¬μ©μκ° μ
λ ₯ν SMILES, Cell Line, Bio Assay, κ·Έλ¦¬κ³ μ£Όμ RDKit Featuresμ μ’
ν©μ μΌλ‘ λΆμνμ¬ μ΅μ’
μ μΌλ‘ λ
μ± μ¬λΆ(λ
μ± / λΉλ
μ±)λ₯Ό νλ¨ν©λλ€.
μ£Όμ νΉμ§
- Base Model: Qwen3-14B
- Task: μ΄μ§ λ μ± μμΈ‘ (Binary Toxicity Prediction) λ° λΆμ ꡬ쑰 λΆμ
- Language: νκ΅μ΄ (μμ€ν μ§μλ¬Έ), μμ΄ (ννμ μΆλ‘ λ° λ΅λ³)
- Input Data:
- SMILES Code
- Cell Line / Cell Type
- Bio Assay Name
- RDKit Features (SHAP Value κΈ°μ€ μ/νμ Feature κ° 3κ°)
ν둬ννΈ νμ
λͺ¨λΈμ μ±λ₯μ μ΅μ ννκΈ° μν΄ νμ΅ μ μ¬μ©λ ν둬ννΈ νμμ μ€μν΄μΌ ν©λλ€.
μμ€ν ν둬ννΈ (System Prompt)
"λΉμ μ λΆμ λ μ± μμΈ‘μ νΉνλ ννμ 보ν/λ μ±ν μ λ¬Έκ°μ λλ€. μ¬μ©μλ λ μ±/λΉλ μ±μ μν₯μ λ§μ΄ λΌμΉλ Feature 3κ°μ©μ μ 곡ν©λλ€... (μ€λ΅) ... tool callμ μ¬μ©νμ§ λ§μΈμ."
μ¬μ©μ μ λ ₯ ν νλ¦Ώ (User Input Template)
SMILES: {smiles_code}
Cell Line: {cell_line}
Bio Assay Name: {endpoint_category}
Feature NL: {feature_NL_description}
Feature Descript: {feature_detailed_description}
{cot_instruction}
Inference
requirements
pip install transformers torch accelerate
Usage with transformers
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# 1. λͺ¨λΈ λ° ν ν¬λμ΄μ λ‘λ
model_id = "TeamUNIVA/Blowfish"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# 2. μμ€ν
ν둬ννΈ μ μ
system_prompt = (
"λΉμ μ λΆμ λ
μ± μμΈ‘μ νΉνλ ννμ 보ν/λ
μ±ν μ λ¬Έκ°μ
λλ€.\n"
"μ¬μ©μλ λ
μ±/λΉλ
μ±μ μν₯μ λ§μ΄ λΌμΉλ Feature 3κ°μ©μ μ 곡ν©λλ€.\n\n"
"μ
λ ₯(μ¬μ©μκ° μ 곡):\n"
"- SMILES\n- Cell Type\n- Cell Line\n- Bio Assay Name\n"
"- λ
μ±μ λΌμΉλ μν₯μ΄ ν° μμ 3κ° RDKit Feature\n"
"- λΉλ
μ±μ λΌμΉλ μν₯μ΄ ν° μμ 3κ° RDKit Feature\n\n"
"μν κ³Όμ
(Tasks):\n"
"SMILES ꡬ쑰 λΆμ\n"
"- κ³ λ¦¬(λ°©ν₯μ‘±/μ§λ°©μ‘±), ν€ν
λ‘μμ, μ ν μ€μ¬, λ°μμ± λͺ¨ν°ν, H-κ²°ν© κ³΅μ¬/μμ©κΈ° λ±μ\n"
" SMILESμμ μ§μ κ΄μ°° κ°λ₯ν λ²μλ‘λ§ κΈ°μ .\n\n"
"Cell Type, Cell Line, Assay Name νΉμ§ λΆμ λ° SMILESμ μ°κ²°\n\n"
"RDKit feature λΆμ\n"
"- κ° featureκ° μλ―Ένλ λ°μ μΌλ°μ λ
μ± λ¦¬μ€ν¬μ μ£Όλ μν₯ μμ½.\n"
"- κ°λ₯ν κ²½μ° Assay λ§₯λ½(μ: ARE μ°νμ€νΈλ μ€)κ³Ό μ°κ²°.\n\n"
"μ’
ν© νλ¨(μ΅μ’
κ²°λ‘ )\n"
"- (1) SMILES λͺ¨ν°ν, (2) Cell line/Cell type + Assay λ§₯λ½, (3) RDKit featureλ₯Ό ν΅ν©ν΄\n"
" λ
μ± μ¬λΆλ₯Ό μ΄μ§μΌλ‘ νλ¨.\n\n"
"μΆλ ₯ κ·μΉ:\n"
"- λ³Έλ¬Έμ μμ΄λ‘ μμ±.\n"
"- λ§μ§λ§ μ€μ μλ μ€ νλλ§ λ¨λ
νκΈ°:\n"
"<answer>toxic</answer>\n"
"<answer>nontoxic</answer>\n\n"
)
# 3. μ
λ ₯ λ°μ΄ν° μμ
smiles_code = "O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.CN(C)CCN(Cc1cccs1)c2ccccn2.CN(C)CCN(Cc1cccs1)c2ccccn2"
cell_line = "HepG2 (Liver)"
feature_NL = "Top toxic features: ... / Top non-toxic features: ...",
feature_descript = "Detailed feature descriptions"
bio_assay = "AhR"
instruction = "νν©λ¬Ό O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.CN(C)CCN(Cc1cccs1)c2ccccn2.CN(C)CCN(Cc1cccs1)c2ccccn2μ λ
μ±/λΉλ
μ± μ¬λΆλ₯Ό νλ¨νμμ€."
# 4. ν둬ννΈ κ΅¬μ±
user_content = (
f"SMILES: {smiles_code}\n"
f"Cell Line: {cell_line}\n"
f"Bio Assay Name: {bio_assay}\n"
f"Feature NL: {feature_NL}\n"
f"Feature Descript: {feature_descript}\n\n"
f"{instruction}"
)
# 5. μ±ν
ν
νλ¦Ώ μ μ© λ° μμ±
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_content}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=8192,
temperature=0.7,
top_p=0.8,
do_sample=True
)
# 6. κ²°κ³Ό λμ½λ©
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
Acknowledgements
λ³Έ κ²°κ³Όλ¬Όμ κ³ΌνκΈ°μ μ 보ν΅μ λΆμ νκ΅μ§λ₯μ 보μ¬νμ§ν₯μμ μ§μμ λ°μ μνλ γ2025λ μ΄κ±°λAI νμ° μνκ³ μ‘°μ±μ¬μ γμ μ°κ΅¬ μ±κ³Όμ μΌλΆμ λλ€.
- Downloads last month
- 17