Pixel-1: From-Scratch Text-to-Image Generator 🎨

Pixel-1 is a lightweight, experimental text-to-image model built and trained entirely from scratch. Unlike many modern generators that rely on massive pre-trained diffusion backbones, Pixel-1 explores the potential of a compact architecture to understand and render complex semantic prompts.

🚀 The Achievement

Pixel-1 was designed to prove that even a small model can achieve high logical alignment with user prompts. It successfully renders complex concepts like window bars, fence shadows, and specific color contrasts—features usually reserved for much larger models.

Key Features:

Built from Scratch: The Generator architecture (Upsampling, Residual Blocks, and Projections) was designed and trained without pre-trained image weights.
High Prompt Adherence: Exceptional ability to "listen" to complex instructions (e.g., "Window with metal bars and fence shadow").
Efficient Architecture: Optimized for fast inference and training on consumer-grade GPUs (like Kaggle's T4).
Latent Understanding: Uses a CLIP-based text encoder to bridge the gap between human language and pixel space.

🏗️ Architecture

The model uses a series of Transposed Convolutional layers combined with Residual Blocks to upsample a latent text vector into a 128x128 image.

Encoder: CLIP (OpenAI/clip-vit-large-patch14)
Decoder: Custom CNN-based Generator with Skip Connections
Loss Function: L1/MSE transition
Resolution: 128x128 (v1)

🖼️ Samples & Prompting

Pixel-1 shines when given high-contrast, descriptive prompts.

Recommended Prompting Style:

"Window with metal bars and fence shadow, high contrast, vivid colors, detailed structure"

Observations: While the current version (v1) produces stylistic, slightly "painterly" or "pixelated" results, its spatial reasoning is remarkably accurate, correctly placing shadows and structural elements according to the text.

🛠️ How to use

import torch
import matplotlib.pyplot as plt
import numpy as np
import os
import shutil
from transformers import AutoTokenizer, CLIPTextModel, AutoModel, AutoConfig

def generate_fixed_from_hub(prompt, model_id="TopAI-1/Pixel-1"):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"🚀 Working on {device}...")

    # 1. ניקוי Cache כדי לוודא שאתה מושך את התיקונים החדשים מה-Hub
    cache_path = os.path.expanduser(f"~/.cache/huggingface/hub/models--{model_id.replace('/', '--')}")
    if os.path.exists(cache_path):
        print("🧹 Clearing old cache to fetch your latest fixes...")
        shutil.rmtree(cache_path)

    # 2. טעינת CLIP
    tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-large-patch14")
    text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to(device)

    # 3. טעינת המודל והקונפיג אוטומטית מה-Hub
    # בזכות ה-auto_map ב-config.json, transformers ימצא לבד את המחלקות
    print("📥 Downloading architecture and weights directly from Hub...")
    model = AutoModel.from_pretrained(
        model_id, 
        trust_remote_code=True, 
        force_download=True
    ).to(device)
    
    model.eval()
    print("✅ Model loaded successfully!")

    # 4. יצירה
    print(f"🎨 Generating: {prompt}")
    inputs = tokenizer(prompt, padding="max_length", max_length=77, truncation=True, return_tensors="pt").to(device)
    
    with torch.no_grad():
        emb = text_encoder(inputs.input_ids).pooler_output
        out = model(emb)

    # 5. תצוגה
    img = (out.squeeze(0).cpu().permute(1, 2, 0).numpy() + 1.0) / 2.0
    plt.figure(figsize=(8, 8))
    plt.imshow(np.clip(img, 0, 1))
    plt.axis('off')
    plt.show()

# הרצה
generate_fixed_from_hub("Window with metal bars and fence shadow")

Downloads last month: 111

Safetensors

Model size

19.2M params

Tensor type

F32

TopAI-1
/

Pixel-1