You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

CL Tagger v2

cella110n （https://huggingface.co/cella110n）

イラスト・アニメ画像向けのマルチラベルタグ分類器（タガー）。Google の SigLIP2 SoViT-400m/14 @384px をビジョンエンコーダとして、Danbooru 系タグ体系を予測する。

A multi-label tagger for illustration / anime images. Built on Google's SigLIP2 SoViT-400m/14 @384px vision encoder, it predicts tags from the Danbooru tag taxonomy.

最新版 / Latest release: v2_01a（暫定版 / provisional） — v2.00 からの継続トレーニング版。 Continued-training build promoted from v2.00.

⚠️ バージョン表記について / Versioning

末尾にアルファベットが付くバージョン（例: v2_01a）は暫定版（provisional）である。 暫定版は予告なく同一バージョン名のまま内容が更新（上書き）されることがある。本ライセンスは再配布を禁じており、配布元は公式リポジトリのみであるため、暫定版を利用する場合は常に公式リポジトリの最新版を取得して使用すること。安定版（例: v2.00）は内容を固定する。

Versions with a trailing letter (e.g. v2_01a) are provisional. A provisional version may be updated in place — overwritten under the same version name — without notice. Because the license prohibits redistribution and the official repository is the only distribution source, always fetch and use the latest copy from the official repository when using a provisional version. Stable versions (e.g. v2.00) are fixed.

ライセンス概要 / License Summary

本モデルは独自ライセンス CL Tagger v2 Model License v1.0（LICENSE.md）のもとで配布される。取得・実行を含む一切の利用は本ライセンスへの同意とみなされる。 以下は主要点の抜粋であり、これがライセンスの全てではない。利用前に必ず LICENSE.md 全文を確認すること。

This model is distributed under a custom license, CL Tagger v2 Model License v1.0 (LICENSE.md). Any use, including acquisition and execution, constitutes acceptance. The points below are highlights, not the full terms — read the full LICENSE.md before use.

許諾される主な利用 / Permitted uses

自己使用 — 自身が排他的に管理する環境での実行（IaaS含む）
サーブ（条件付） — 全エンドユーザーへの平等な提供、および第三者ホスティング条件の遵守を要する
出力の利用 — タグ・特徴ベクトル等の使用・改変・商用利用は制限しない
派生物の自己使用 — LoRA・量子化・形式変換・マージ等。自己使用の範囲に限る

禁止される主な利用 / Prohibited uses

再配布・公衆送信・同梱配布 — 公式リポジトリ以外での配布、および派生物の共有
本モデルに関するコンテンツの有償提供 — 利用方法・レビュー・評価等、本モデル自体を主たる対象とするコンテンツの有料公開（ペイウォール・有料会員限定等）。無償公開は制限しない
出力の権利主張への道具化 — モデルの出力を、第三者の著作物・人格・行為に対する事実認定や法的主張の根拠とすること、およびそれに基づく権利行使・糾弾

バージョン履歴 / Version history

バージョン	公開日	区分	学習量	タグ数
v2.00	2026-06-14	正式版 / stable	10 epochs・約 949,537 steps	106,536
v2_01a	2026-06-18	暫定版 / provisional	約 1,101,482 steps（v2.00 から継続）	108,036

v2.00 は開発版 v1.09 を昇格したもの。v2_01a は v2.00 から継続学習し、語彙を拡張した暫定版。 v2.00 was promoted from development build v1.09. v2_01a continues training from v2.00 with an expanded vocabulary.

モデル仕様 / Model Specifications

項目	値
ビジョンエンコーダ	`google/siglip2-so400m-patch14-384`（Apache-2.0、NaFlexではない固定384px）
学習方式	LoRA fine-tuning（rank 32 / alpha 16 / 108 modules）
出力タグ数	バージョン依存（v2.00: 106,536 / v2_01a: 108,036）
入力	`pixel_values` : `[batch, 3, 384, 384]`（SigLIP2正規化, mean=std=0.5）
出力	`logits` : `[batch, num_tags]` → sigmoid でタグ確率
キャリブレーション	タグ毎（Jeffreys prior）。較正テーブル＋タグ毎しきい値を同梱

語彙構成 / Vocabulary composition（v2_01a）

カテゴリ	タグ数
Character	49,654
General	48,888
Copyright	9,152
Meta	334
Rating	4 (`general`, `sensitive`, `questionable`, `explicit`)
Quality	4 (`best`/`normal`/`bad`/`worst quality`)
合計	108,036

v2.00 の語彙は 106,536 タグ（Character 49,516 / General 47,654 / Copyright 9,025 / Meta 333 / Rating 4 / Quality 4）。 v2_01a は継続学習で語彙を拡張している。

性能 / Performance

指標は タグ毎に最適化したしきい値での F1（本モデルは較正テーブルとタグ毎しきい値を同梱しており、これが実運用に近い数値）。バージョン毎に評価コーパス・語彙が異なるため、数値はそれぞれの公開語彙上での値である。

Metrics use per-tag optimal-threshold F1 (the model ships a calibration table and per-tag thresholds, so this reflects realistic deployment). Each version is evaluated on its own published vocabulary, so the numbers are not a same-tag-set comparison.

全体 / Overall

バージョン	Macro-F1	Median F1	評価タグ数	評価コーパス（直近 / 累計）
v2.00	0.669	0.734	106,534	約 4.08M / 約 10.59M
v2_01a	0.656	0.722	108,033	約 4.19M / 約 14.71M

上記はタグ毎メトリクスを最大化するしきい値での値。F1を最大化するしきい値（タグ平均で約0.41）と、誤検出を抑えた実用上のしきい値は異なる。単一しきい値で運用する場合は 0.55 を推奨する（下記）。 v2_01a は v2.00 から語彙を拡張しており、新規・低頻度タグを多く含むため macro 平均は見かけ上低めに出る。

カテゴリ別 Macro-F1 / Per-category（Macro-F1 / Median F1）

カテゴリ	v2.00	v2_01a	タグ数 (v2_01a)
Character	0.826 / 0.843	0.827 / 0.838	49,654
Copyright	0.711 / 0.731	0.698 / 0.717	9,152
Rating	0.680 / 0.699	0.681 / 0.701	4
General	0.499 / 0.488	0.477 / 0.468	48,888
Meta	0.436 / 0.430	0.401 / 0.374	334
Quality	0.415 / 0.414	0.419 / 0.418	4

出現頻度（正例数）別 Macro-F1 / By tag support（Macro-F1 / Median F1）

正例数 n_pos	v2.00 タグ数	v2.00	v2_01a タグ数	v2_01a
≥ 10	107,791	0.668 / 0.732	107,989	0.656 / 0.722
≥ 50	86,316	0.664 / 0.740	100,804	0.655 / 0.725
≥ 100	64,197	0.646 / 0.719	83,256	0.645 / 0.714
≥ 500	19,055	0.522 / 0.500	27,079	0.526 / 0.501
≥ 1000	11,386	0.482 / 0.461	16,198	0.488 / 0.468

推論の使い方 / Inference

ONNX Runtime での最小例（前処理は SigLIP2 準拠）:

import numpy as np, json
from PIL import Image
import onnxruntime as ort

sess = ort.InferenceSession("model.onnx", providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
vocab = json.load(open("model_vocabulary.json", encoding="utf-8"))
idx_to_tag = vocab["idx_to_tag"]

# Preprocess: resize to 384x384, scale to [0,1], normalize mean=std=0.5
img = Image.open("input.png").convert("RGB").resize((384, 384), Image.BICUBIC)
x = (np.asarray(img, np.float32) / 255.0 - 0.5) / 0.5
x = x.transpose(2, 0, 1)[None]  # [1,3,384,384]

logits = sess.run(["logits"], {"pixel_values": x})[0][0]
probs = 1.0 / (1.0 + np.exp(-logits))

# Threshold: 0.55 is recommended for practical tagging.
# (tag_metrics.npz best_thr maximizes F1 but tends to over-tag.)
thr = 0.55
tags = [idx_to_tag[str(i)] for i in np.where(probs >= thr)[0]]
print(tags)

しきい値 / Thresholds: 単一しきい値で運用する場合は 0.55 を推奨する。model_tag_metrics.npz の best_thr はタグ毎にF1を最大化する値だが、実用上は過剰にタグを付けやすい。精度重視ならしきい値を上げ、再現重視なら下げる。calibration_table は確率較正に利用できる。

同梱ファイル / Files

ファイル	内容
`model.onnx` + `model.onnx.data`	ONNX モデル（重みは外部データ）
`model_vocabulary.json`	タグ語彙（`tag_to_idx` / `idx_to_tag` / `tag_to_category` / `categories`）
`model_tag_metrics.npz`	タグ毎メトリクス・較正テーブル・タグ毎しきい値（`best_thr` / `best_f1` / `calibration_table` ほか）
`model_ood_ref.npz`	OOD（分布外）検出用の参照統計
`model_metadata.json`	エンコーダ・タグ数等のメタ情報

帰属表示 / Attribution

本モデルは Google が Apache License 2.0 のもとで公開した SigLIP2 を基に訓練されている。 This model was trained on the basis of SigLIP2, published by Google under the Apache License 2.0.

@article{tschannen2025siglip2,
  title={SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features},
  author={Tschannen, Michael and others},
  journal={arXiv preprint arXiv:2502.14786},
  year={2025}
}

制限事項 / Limitations

出力は確率的分類器の出力であり、誤検出（false positive / false negative）を本質的に含む。出力を第三者の権利・人格・行為に対する事実認定や法的主張の根拠に用いることはライセンス上禁止される。
General / Meta カテゴリの主観的・曖昧なタグは精度が相対的に低い。
学習データの分布から大きく外れた画像では性能が低下する（OOD参照統計を併用可能）。タグ体系は Danbooru 系だが、学習画像はイラストに限らず多様なソースを含む。

連絡先 / Contact

公式リポジトリ https://huggingface.co/cella110n を通じて連絡すること。

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for cella110n/cl_tagger_v2

Base model

google/siglip2-so400m-patch14-384

Quantized

(6)

this model

Space using cella110n/cl_tagger_v2 1

Paper for cella110n/cl_tagger_v2

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Paper • 2502.14786 • Published Feb 20, 2025 • 166