Blowfish

Introduction

**BlowfishλŠ” λΆ„μž 독성 μ˜ˆμΈ‘μ„ μˆ˜ν–‰ν•˜κΈ° μœ„ν•΄ 개발된 λŒ€ν˜• μ–Έμ–΄ λͺ¨λΈ(LLM)μž…λ‹ˆλ‹€.

Qwen3-14Bλ₯Ό 기반으둜 νŒŒμΈνŠœλ‹(Fine-tuning)λ˜μ—ˆμœΌλ©°, λ‹¨μˆœν•œ 이진 λΆ„λ₯˜λ₯Ό λ„˜μ–΄ Chain-of-Thought (CoT) 방식을 톡해 독성 νŒμ •μ˜ 화학적/생물학적 κ·Όκ±°λ₯Ό λ…Όλ¦¬μ μœΌλ‘œ μ„€λͺ…ν•©λ‹ˆλ‹€.

μ‚¬μš©μžκ°€ μž…λ ₯ν•œ SMILES, Cell Line, Bio Assay, 그리고 μ£Όμš” RDKit Features을 μ’…ν•©μ μœΌλ‘œ λΆ„μ„ν•˜μ—¬ μ΅œμ’…μ μœΌλ‘œ 독성 μ—¬λΆ€(독성 / 비독성)λ₯Ό νŒλ‹¨ν•©λ‹ˆλ‹€.

μ£Όμš” νŠΉμ§•

  • Base Model: Qwen3-14B
  • Task: 이진 독성 예츑 (Binary Toxicity Prediction) 및 λΆ„μž ꡬ쑰 뢄석
  • Language: ν•œκ΅­μ–΄ (μ‹œμŠ€ν…œ μ§€μ‹œλ¬Έ), μ˜μ–΄ (화학적 μΆ”λ‘  및 λ‹΅λ³€)
  • Input Data:
    • SMILES Code
    • Cell Line / Cell Type
    • Bio Assay Name
    • RDKit Features (SHAP Value κΈ°μ€€ 상/ν•˜μœ„ Feature 각 3개)

ν”„λ‘¬ν”„νŠΈ ν˜•μ‹

λͺ¨λΈμ˜ μ„±λŠ₯을 μ΅œμ ν™”ν•˜κΈ° μœ„ν•΄ ν•™μŠ΅ μ‹œ μ‚¬μš©λœ ν”„λ‘¬ν”„νŠΈ ν˜•μ‹μ„ μ€€μˆ˜ν•΄μ•Ό ν•©λ‹ˆλ‹€.

μ‹œμŠ€ν…œ ν”„λ‘¬ν”„νŠΈ (System Prompt)

"당신은 λΆ„μž 독성 μ˜ˆμΈ‘μ— νŠΉν™”λœ 화학정보학/독성학 μ „λ¬Έκ°€μž…λ‹ˆλ‹€. μ‚¬μš©μžλŠ” 독성/비독성에 영ν–₯을 많이 λΌμΉ˜λŠ” Feature 3κ°œμ”©μ„ μ œκ³΅ν•©λ‹ˆλ‹€... (μ€‘λž΅) ... tool call을 μ‚¬μš©ν•˜μ§€ λ§ˆμ„Έμš”."

μ‚¬μš©μž μž…λ ₯ ν…œν”Œλ¦Ώ (User Input Template)

SMILES: {smiles_code}
Cell Line: {cell_line}
Bio Assay Name: {endpoint_category}
Feature NL: {feature_NL_description}
Feature Descript: {feature_detailed_description}

{cot_instruction}

Inference

requirements

pip install transformers torch accelerate

Usage with transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# 1. λͺ¨λΈ 및 ν† ν¬λ‚˜μ΄μ € λ‘œλ“œ
model_id = "TeamUNIVA/Blowfish"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# 2. μ‹œμŠ€ν…œ ν”„λ‘¬ν”„νŠΈ μ •μ˜
system_prompt = (
    "당신은 λΆ„μž 독성 μ˜ˆμΈ‘μ— νŠΉν™”λœ 화학정보학/독성학 μ „λ¬Έκ°€μž…λ‹ˆλ‹€.\n"
    "μ‚¬μš©μžλŠ” 독성/비독성에 영ν–₯을 많이 λΌμΉ˜λŠ” Feature 3κ°œμ”©μ„ μ œκ³΅ν•©λ‹ˆλ‹€.\n\n"
    "μž…λ ₯(μ‚¬μš©μžκ°€ 제곡):\n"
    "- SMILES\n- Cell Type\n- Cell Line\n- Bio Assay Name\n"
    "- 독성에 λΌμΉ˜λŠ” 영ν–₯이 큰 μƒμœ„ 3개 RDKit Feature\n"
    "- 비독성에 λΌμΉ˜λŠ” 영ν–₯이 큰 μƒμœ„ 3개 RDKit Feature\n\n"
    "μˆ˜ν–‰ κ³Όμ—…(Tasks):\n"
    "SMILES ꡬ쑰 뢄석\n"
    "- 고리(λ°©ν–₯μ‘±/μ§€λ°©μ‘±), ν—€ν…Œλ‘œμ›μž, μ „ν•˜ 쀑심, λ°˜μ‘μ„± λͺ¨ν‹°ν”„, H-κ²°ν•© 곡여/수용기 등을\n"
    "  SMILESμ—μ„œ 직접 κ΄€μ°° κ°€λŠ₯ν•œ λ²”μœ„λ‘œλ§Œ 기술.\n\n"
    "Cell Type, Cell Line, Assay Name νŠΉμ§• 뢄석 및 SMILES와 μ—°κ²°\n\n"
    "RDKit feature 뢄석\n"
    "- 각 featureκ°€ μ˜λ―Έν•˜λŠ” 바와 일반적 독성 λ¦¬μŠ€ν¬μ— μ£ΌλŠ” 영ν–₯ μš”μ•½.\n"
    "- κ°€λŠ₯ν•œ 경우 Assay λ§₯락(예: ARE μ‚°ν™”μŠ€νŠΈλ ˆμŠ€)κ³Ό μ—°κ²°.\n\n"
    "μ’…ν•© νŒλ‹¨(μ΅œμ’… κ²°λ‘ )\n"
    "- (1) SMILES λͺ¨ν‹°ν”„, (2) Cell line/Cell type + Assay λ§₯락, (3) RDKit featureλ₯Ό 톡합해\n"
    "  독성 μ—¬λΆ€λ₯Ό μ΄μ§„μœΌλ‘œ νŒλ‹¨.\n\n"
    "좜λ ₯ κ·œμΉ™:\n"
    "- 본문은 μ˜μ–΄λ‘œ μž‘μ„±.\n"
    "- λ§ˆμ§€λ§‰ 쀄에 μ•„λž˜ 쀑 ν•˜λ‚˜λ§Œ 단독 ν‘œκΈ°:\n"
    "<answer>toxic</answer>\n"
    "<answer>nontoxic</answer>\n\n"
)

# 3. μž…λ ₯ 데이터 μ˜ˆμ‹œ
smiles_code =  "O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.CN(C)CCN(Cc1cccs1)c2ccccn2.CN(C)CCN(Cc1cccs1)c2ccccn2"
cell_line = "HepG2 (Liver)"
feature_NL = "Top toxic features: ... / Top non-toxic features: ...",
feature_descript = "Detailed feature descriptions"
bio_assay = "AhR"
instruction = "ν™”ν•©λ¬Ό O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.CN(C)CCN(Cc1cccs1)c2ccccn2.CN(C)CCN(Cc1cccs1)c2ccccn2의 독성/비독성 μ—¬λΆ€λ₯Ό νŒλ‹¨ν•˜μ‹œμ˜€."


# 4. ν”„λ‘¬ν”„νŠΈ ꡬ성
user_content = (
    f"SMILES: {smiles_code}\n"
    f"Cell Line: {cell_line}\n"
    f"Bio Assay Name: {bio_assay}\n"
    f"Feature NL: {feature_NL}\n"
    f"Feature Descript: {feature_descript}\n\n"
    f"{instruction}"
)

# 5. μ±„νŒ… ν…œν”Œλ¦Ώ 적용 및 생성
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_content}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=8192,
    temperature=0.7,
    top_p=0.8,
    do_sample=True
)

# 6. κ²°κ³Ό λ””μ½”λ”©
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Acknowledgements

λ³Έ 결과물은 κ³Όν•™κΈ°μˆ μ •λ³΄ν†΅μ‹ λΆ€μ™€ ν•œκ΅­μ§€λŠ₯μ •λ³΄μ‚¬νšŒμ§„ν₯μ›μ˜ 지원을 λ°›μ•„ μˆ˜ν–‰λœ γ€Œ2025λ…„ μ΄ˆκ±°λŒ€AI ν™•μ‚° μƒνƒœκ³„ μ‘°μ„±μ‚¬μ—…γ€μ˜ 연ꡬ μ„±κ³Όμ˜ μΌλΆ€μž…λ‹ˆλ‹€.

Downloads last month
17
Safetensors
Model size
425k params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for TeamUNIVA/Blowfish

Finetuned
Qwen/Qwen3-14B
Finetuned
(204)
this model
Quantizations
1 model