Buckets:

|
download
raw
1.07 kB

Synthetic Data Factory

Multi-module Python pipeline that generates synthetic relational datasets (users, products, transactions), validates them with Pydantic, runs quality checks, and exports to Parquet/JSONL/CSV with an HTML report.

Quick start

pip install -r requirements.txt
python job.py

Environment variables

Variable Description Default
OUTPUT_DIR Directory where results are written ./output

Output

$OUTPUT_DIR/
  users/          users.parquet, users.jsonl, users.csv
  products/       products.parquet, products.jsonl, products.csv
  transactions/   transactions.parquet, transactions.jsonl, transactions.csv
  report.html     Visual quality report with embedded charts

Configuration

Edit synthetic_factory/config.py to change:

  • SEED — random seed for reproducibility
  • NUM_USERS, NUM_PRODUCTS, NUM_TRANSACTIONS — record counts
  • CATEGORIES, PAYMENT_METHODS — domain values

Xet Storage Details

Size:
1.07 kB
·
Xet hash:
d49cf63a62d1e7b5673692db0b2a9c171d79b1a08c068a1f2e9317eabb5ba7e2

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.