Pixel-1: From-Scratch Text-to-Image Generator ๐จ
Pixel-1 is a lightweight, experimental text-to-image model built and trained entirely from scratch. Unlike many modern generators that rely on massive pre-trained diffusion backbones, Pixel-1 explores the potential of a compact architecture to understand and render complex semantic prompts.
๐ The Achievement
Pixel-1 was designed to prove that even a small model can achieve high logical alignment with user prompts. It successfully renders complex concepts like window bars, fence shadows, and specific color contrastsโfeatures usually reserved for much larger models.
Key Features:
- Built from Scratch: The Generator architecture (Upsampling, Residual Blocks, and Projections) was designed and trained without pre-trained image weights.
- High Prompt Adherence: Exceptional ability to "listen" to complex instructions (e.g., "Window with metal bars and fence shadow").
- Efficient Architecture: Optimized for fast inference and training on consumer-grade GPUs (like Kaggle's T4).
- Latent Understanding: Uses a CLIP-based text encoder to bridge the gap between human language and pixel space.
๐๏ธ Architecture
The model uses a series of Transposed Convolutional layers combined with Residual Blocks to upsample a latent text vector into a 128x128 image.
- Encoder: CLIP (OpenAI/clip-vit-large-patch14)
- Decoder: Custom CNN-based Generator with Skip Connections
- Loss Function: L1/MSE transition
- Resolution: 128x128 (v1)
๐ผ๏ธ Samples & Prompting
Pixel-1 shines when given high-contrast, descriptive prompts.
Recommended Prompting Style:
"Window with metal bars and fence shadow, high contrast, vivid colors, detailed structure"
Observations: While the current version (v1) produces stylistic, slightly "painterly" or "pixelated" results, its spatial reasoning is remarkably accurate, correctly placing shadows and structural elements according to the text.
๐ ๏ธ How to use
import torch
import matplotlib.pyplot as plt
import numpy as np
import os
import shutil
from transformers import AutoTokenizer, CLIPTextModel, AutoModel, AutoConfig
def generate_fixed_from_hub(prompt, model_id="TopAI-1/Pixel-1"):
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"๐ Working on {device}...")
# 1. ื ืืงืื Cache ืืื ืืืืื ืฉืืชื ืืืฉื ืืช ืืชืืงืื ืื ืืืืฉืื ืื-Hub
cache_path = os.path.expanduser(f"~/.cache/huggingface/hub/models--{model_id.replace('/', '--')}")
if os.path.exists(cache_path):
print("๐งน Clearing old cache to fetch your latest fixes...")
shutil.rmtree(cache_path)
# 2. ืืขืื ืช CLIP
tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to(device)
# 3. ืืขืื ืช ืืืืื ืืืงืื ืคืื ืืืืืืืืช ืื-Hub
# ืืืืืช ื-auto_map ื-config.json, transformers ืืืฆื ืืื ืืช ืืืืืงืืช
print("๐ฅ Downloading architecture and weights directly from Hub...")
model = AutoModel.from_pretrained(
model_id,
trust_remote_code=True,
force_download=True
).to(device)
model.eval()
print("โ
Model loaded successfully!")
# 4. ืืฆืืจื
print(f"๐จ Generating: {prompt}")
inputs = tokenizer(prompt, padding="max_length", max_length=77, truncation=True, return_tensors="pt").to(device)
with torch.no_grad():
emb = text_encoder(inputs.input_ids).pooler_output
out = model(emb)
# 5. ืชืฆืืื
img = (out.squeeze(0).cpu().permute(1, 2, 0).numpy() + 1.0) / 2.0
plt.figure(figsize=(8, 8))
plt.imshow(np.clip(img, 0, 1))
plt.axis('off')
plt.show()
# ืืจืฆื
generate_fixed_from_hub("Window with metal bars and fence shadow")
- Downloads last month
- 345
