Model Card for FontDiffuser

Model Details

Model Type

  • Architecture: Diffusion-based Font Generation Model
  • Framework: PyTorch + Hugging Face Diffusers
  • Scheduler: DPM-Solver++ (configurable: dpmsolver++ / dpmsolver)
  • Guidance: Classifier-free guidance
  • Base Model: FontDiffuser with Content and Style Encoders

Model Components

  1. UNet: Main diffusion model for image generation
  2. Content Encoder: Extracts character structure information
  3. Style Encoder: Extracts font style features
  4. DDPM/DPM Scheduler: Noise scheduling for diffusion process

Training Configuration

  • Resolution: 96Γ—96 pixels
  • Batch Size: configurable
  • Inference Steps: 20 (default, configurable)
  • Guidance Scale: 7.5 (default, configurable)
  • Precision: FP32/FP16 (optional)
  • Device: CUDA/GPU recommended

Installation

The installation utilize uv package manager for its high speed due to implementation in Rust

uv pip install diffusers torch torchvision safetensors
uv pip install lpips scikit-image pytorch-fid  # Optional: for evaluation

Model usage

  • Load pipeline:
from argparse import Namespace
from inference.sample_optimized import load_fontdiffuser_pipeline

args = Namespace(
    ckpt_dir="ckpt",
    device="cuda:0",
    guidance_scale=7.5,
    num_inference_steps=20,
    fp16=False,
    enable_xformers=False,
)
pipe = load_fontdiffuser_pipeline(args=args)
  • Single-image inference (recommended)
accelerate launch run_inference.py \
  --ckpt_dir ckpt \
  --content_character "A" \
  --style_image_path style_images/foo.png \
  --save_image \
  --save_image_dir results/
  • Large-scale batch with checkpoint/resume
accelerate launch run_inference.py \
  --ckpt_dir ckpt \
  --characters chars.txt \
  --style_images "style_images/*.png" \
  --ttf_path fonts/myfont.ttf \
  --output_dir my_dataset/train_original \
  --batch_size 8 \
  --num_inference_steps 15 \
  --guidance_scale 7.5 \
  --save_interval 10
  • Multi-GPU inference via Accelerate
accelerate launch run_inference.py \
  --ckpt_dir ckpt \
  --characters chars.txt \
  --style_images "style_images/*.png" \
  --output_dir results/

Outputs & metadata

Repo uses hash-based filenames (tools/filename_utils.py) and a central metadata file:

  • ContentImage/char.png β€” character content images
  • TargetImage/style+char.png β€” generated images per style
  • results_checkpoint.json β€” canonical metadata used by dataset tools and HF exporters

Example metadata generation:

python tools/generate_metadata.py --data_root my_dataset/handwritten_original --output my_dataset/handwritten_original/results_checkpoint.json

Model Performance

Supported Tasks

  • Single-character font generation
  • Multi-character batch generation
  • Multi-font support
  • Multi-style transfer
  • Index-based tracking for large-scale generation
  • Checkpoint and resume support

Output Format

output_dir/
β”œβ”€β”€ ContentImage/              # Single set of content (character) images
β”‚   β”œβ”€β”€ char0.png
β”‚   β”œβ”€β”€ char1.png
β”‚   └── ...
β”œβ”€β”€ TargetImage/               # Generated font images organized by style
β”‚   β”œβ”€β”€ style0/
β”‚   β”‚   β”œβ”€β”€ style0+char0.png
β”‚   β”‚   β”œβ”€β”€ style0+char1.png
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ style1/
β”‚   β”‚   └── ...
β”‚   └── ...
β”œβ”€β”€ results_checkpoint.json    # Checkpoint act as generation metadata

Results Metadata Structure

{
  "generations": [
    {
      "character": "A",
      "char_index": 0,
      "style": "style0",
      "style_index": 0,
      "font": "Arial",
      "style_path": "path/to/style0.png",
      "output_path": "TargetImage/style0/style0+char0.png"
    }
  ],
  "metrics": {
    "lpips": {"mean": 0.25, "std": 0.08, "min": 0.1, "max": 0.5},
    "ssim": {"mean": 0.82, "std": 0.05, "min": 0.7, "max": 0.95},
    "fid": {"mean": 15.3, "std": 2.1},
    "inference_times": [
      {
        "style": "style0",
        "style_index": 0,
        "font": "Arial",
        "total_time": 2.45,
        "num_images": 100,
        "time_per_image": 0.0245
      }
    ]
  },
  "fonts": ["Arial", "Times New Roman"],
  "characters": ["A", "B", "C"],
  "styles": ["style0", "style1"],
  "total_chars": 3,
  "total_styles": 2,
  "total_possible_pairs": 6
}

Dataset

Dataset Source

Dataset Structure

FontDiffusion Dataset/
β”œβ”€β”€ total/
β”‚   β”œβ”€β”€ ContentImage/          # Character structure images
β”‚   β”œβ”€β”€ TargetImage/           # Style-specific font renderings
β”‚   └── results_checkpoint.json
β”œβ”€β”€ val/
└── test/

Technical Features

Optimizations

  • Batch Processing: Process multiple characters per style
  • Memory Efficiency: Attention slicing (optional)
  • FP16 Support: Reduced precision for faster inference
  • Torch Compile: Optional model compilation
  • Channels Last Format: Memory-optimized tensor layout
  • XFormers Support: Fast attention implementation

Robustness

  • Checkpoint & Resume: Resume from interruptions
  • Index-based Tracking: Handle large character sets (100K+)
  • Multi-font Support: Process characters across multiple fonts
  • Error Recovery: Graceful handling of missing fonts
  • Automatic Indexing: Consistent char_index and style_index

Monitoring

  • Weights & Biases Integration: Real-time tracking
  • Progress Bars: Detailed generation progress
  • Checkpoint Saving: Periodic intermediate saves
  • Quality Metrics: LPIPS, SSIM, FID computation

Citation

@article{fontdiffuser2023,
  title={FontDiffuser: One-Shot Font Generation via Denoising Diffusion with Multi-Scale Content Aggregation and Style Contrastive Learning},
  author={Zhenhua Yang, Dezhi Peng, Yuxin Kong, Yuyi Zhang, Cong Yao, Lianwen Jin},
  year={2023}
}

License

This model is licensed under the Apache License 2.0. See LICENSE file for details.


Downloads last month
264
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train dzungpham/font-architect