Model Card for FontDiffuser
Model Details
Model Type
- Architecture: Diffusion-based Font Generation Model
- Framework: PyTorch + Hugging Face Diffusers
- Scheduler: DPM-Solver++ (configurable: dpmsolver++ / dpmsolver)
- Guidance: Classifier-free guidance
- Base Model: FontDiffuser with Content and Style Encoders
Model Components
- UNet: Main diffusion model for image generation
- Content Encoder: Extracts character structure information
- Style Encoder: Extracts font style features
- DDPM/DPM Scheduler: Noise scheduling for diffusion process
Training Configuration
- Resolution: 96Γ96 pixels
- Batch Size: configurable
- Inference Steps: 20 (default, configurable)
- Guidance Scale: 7.5 (default, configurable)
- Precision: FP32/FP16 (optional)
- Device: CUDA/GPU recommended
Installation
The installation utilize uv package manager for its high speed due to implementation in Rust
uv pip install diffusers torch torchvision safetensors
uv pip install lpips scikit-image pytorch-fid # Optional: for evaluation
Model usage
- Load pipeline:
from argparse import Namespace
from inference.sample_optimized import load_fontdiffuser_pipeline
args = Namespace(
ckpt_dir="ckpt",
device="cuda:0",
guidance_scale=7.5,
num_inference_steps=20,
fp16=False,
enable_xformers=False,
)
pipe = load_fontdiffuser_pipeline(args=args)
- Single-image inference (recommended)
accelerate launch run_inference.py \
--ckpt_dir ckpt \
--content_character "A" \
--style_image_path style_images/foo.png \
--save_image \
--save_image_dir results/
- Large-scale batch with checkpoint/resume
accelerate launch run_inference.py \
--ckpt_dir ckpt \
--characters chars.txt \
--style_images "style_images/*.png" \
--ttf_path fonts/myfont.ttf \
--output_dir my_dataset/train_original \
--batch_size 8 \
--num_inference_steps 15 \
--guidance_scale 7.5 \
--save_interval 10
- Multi-GPU inference via Accelerate
accelerate launch run_inference.py \
--ckpt_dir ckpt \
--characters chars.txt \
--style_images "style_images/*.png" \
--output_dir results/
Outputs & metadata
Repo uses hash-based filenames (tools/filename_utils.py) and a central metadata file:
- ContentImage/char.png β character content images
- TargetImage/style+char.png β generated images per style
- results_checkpoint.json β canonical metadata used by dataset tools and HF exporters
Example metadata generation:
python tools/generate_metadata.py --data_root my_dataset/handwritten_original --output my_dataset/handwritten_original/results_checkpoint.json
Model Performance
Supported Tasks
- Single-character font generation
- Multi-character batch generation
- Multi-font support
- Multi-style transfer
- Index-based tracking for large-scale generation
- Checkpoint and resume support
Output Format
output_dir/
βββ ContentImage/ # Single set of content (character) images
β βββ char0.png
β βββ char1.png
β βββ ...
βββ TargetImage/ # Generated font images organized by style
β βββ style0/
β β βββ style0+char0.png
β β βββ style0+char1.png
β β βββ ...
β βββ style1/
β β βββ ...
β βββ ...
βββ results_checkpoint.json # Checkpoint act as generation metadata
Results Metadata Structure
{
"generations": [
{
"character": "A",
"char_index": 0,
"style": "style0",
"style_index": 0,
"font": "Arial",
"style_path": "path/to/style0.png",
"output_path": "TargetImage/style0/style0+char0.png"
}
],
"metrics": {
"lpips": {"mean": 0.25, "std": 0.08, "min": 0.1, "max": 0.5},
"ssim": {"mean": 0.82, "std": 0.05, "min": 0.7, "max": 0.95},
"fid": {"mean": 15.3, "std": 2.1},
"inference_times": [
{
"style": "style0",
"style_index": 0,
"font": "Arial",
"total_time": 2.45,
"num_images": 100,
"time_per_image": 0.0245
}
]
},
"fonts": ["Arial", "Times New Roman"],
"characters": ["A", "B", "C"],
"styles": ["style0", "style1"],
"total_chars": 3,
"total_styles": 2,
"total_possible_pairs": 6
}
Dataset
Dataset Source
- Name: font-diffusion-generated-data
- Link: https://huggingface.co/datasets/dzungpham/font-diffusion-generated-data
- Format: ContentImage + TargetImage per style
- Supports: Multi-font, multi-character, multi-style generation
Dataset Structure
FontDiffusion Dataset/
βββ total/
β βββ ContentImage/ # Character structure images
β βββ TargetImage/ # Style-specific font renderings
β βββ results_checkpoint.json
βββ val/
βββ test/
Technical Features
Optimizations
- Batch Processing: Process multiple characters per style
- Memory Efficiency: Attention slicing (optional)
- FP16 Support: Reduced precision for faster inference
- Torch Compile: Optional model compilation
- Channels Last Format: Memory-optimized tensor layout
- XFormers Support: Fast attention implementation
Robustness
- Checkpoint & Resume: Resume from interruptions
- Index-based Tracking: Handle large character sets (100K+)
- Multi-font Support: Process characters across multiple fonts
- Error Recovery: Graceful handling of missing fonts
- Automatic Indexing: Consistent char_index and style_index
Monitoring
- Weights & Biases Integration: Real-time tracking
- Progress Bars: Detailed generation progress
- Checkpoint Saving: Periodic intermediate saves
- Quality Metrics: LPIPS, SSIM, FID computation
Citation
@article{fontdiffuser2023,
title={FontDiffuser: One-Shot Font Generation via Denoising Diffusion with Multi-Scale Content Aggregation and Style Contrastive Learning},
author={Zhenhua Yang, Dezhi Peng, Yuxin Kong, Yuyi Zhang, Cong Yao, Lianwen Jin},
year={2023}
}
License
This model is licensed under the Apache License 2.0. See LICENSE file for details.
- Downloads last month
- 264