AI & ML interests

open-data, public-data, korean-nlp, tabular-data, datasets

Recent Activity

yeongseonchoe  updated a dataset 20 days ago
kpubdata/seoul-bike-rent-month
yeongseonchoe  published a dataset 20 days ago
kpubdata/seoul-bike-rent-month
yeongseonchoe  updated a dataset 24 days ago
kpubdata/seoul-apartment-rent
View all activity

Organization Card

kpubdata — Korean Public Data for Everyone

Making Korean government open data accessible worldwide with a single line of code.

from datasets import load_dataset

ds = load_dataset("kpubdata/seoul-apartment-trades")
df = ds["train"].to_pandas()

Mission

Korean public data (data.go.kr) is valuable but hard to access: complex API authentication, XML responses, Korean-only documentation, and no standard formats like Parquet or HuggingFace Datasets.

We bridge the gap — raw public data, cleaned and published as HuggingFace Datasets. No feature engineering, no opinions. Just honest, well-documented government data ready to use.

Principles

  • Source fidelity: Original Korean text values preserved as-is. English column names for accessibility.
  • Schema honesty: What is declared in the config is exactly what you get. No phantom columns, no all-null surprises.
  • Global-first documentation: Dataset cards in English with Korean domain context explained for international users.
  • No feature engineering: We publish clean raw data. Users add derived features (geocoding, distances, etc.) themselves — just like Kaggle.

Available Datasets

Dataset Records Period Source Description
seoul-apartment-trades ~234k 2020–2024 MOLIT via data.go.kr Apartment sale transactions in Seoul, all 25 districts

More datasets coming — air quality, weather, transit, and more.

How It Works

[data.go.kr API] → [kpubdata SDK] → [kpubdata-builder pipeline] → [HuggingFace Dataset]
  1. kpubdata — Python SDK that handles API auth, pagination, and response parsing for Korean public data portals
  2. kpubdata-builder — Pipeline that fetches, transforms, validates, and publishes datasets to HuggingFace

Contributing

We welcome contributions! If there is a Korean public dataset you would like to see on HuggingFace:

  1. Check if the source API is available on data.go.kr
  2. Open an issue on kpubdata-builder
  3. Or submit a PR with a new dataset config (see publishing standards)

License

Datasets are published under licenses compatible with their original government data licenses. Most Korean public data uses 공공누리 (Korea Open Government License), mapped to CC-BY-4.0.

See individual dataset cards for specific licensing details.

models 0

None public yet