🏗️ Building on HF

Dipankar Sarkar PRO

dipankarsarkar

https://www.dipankar.cc

AI & ML interests

Building the AI-native stack. Agents as infrastructure, safety as architecture, performance as plumbing. I publish the receipts: papers, datasets, demos.

Recent Activity

upvoted a paper 38 minutes ago

BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding

upvoted a paper 39 minutes ago

DOPD: Dual On-policy Distillation

upvoted a paper 39 minutes ago

Dockerless: Environment-Free Program Verifier for Coding Agents

View all activity

Organizations

replied to ginigen-ai's post 40 minutes ago

Accuracy is the wrong headline here, and you named it. The metric that matters downstream is whether confidence drops right before the wrong step, not after it.

In an agent loop that gap is the whole game. A model that knows it is unsure stops and re-plans. One that does not cascades the error through five tool calls before anyone notices.

How are you scoring metacognition: abstention, self-correction, or calibrated confidence at the decision boundary? Those three reward very different models.

reacted to ginigen-ai's post with 🔥 41 minutes ago

Post

1095

🧠 Does your LLM know when it's about to be wrong?

Most leaderboards measure accuracy. We measure metacognition — whether a model catches its own errors. Benchmark + leaderboard + adapters, all open. 🎉

The surprise: even a K-AI #1 model (JGOS-31B-Citizen) is the strongest on multiple-choice traps (trap_rate 0.005 — ~2 misses in 400) yet blind to its own free-form mistakes (self-confidence AUROC = 0.5, pure random). A tiny base-frozen adapter recovers that signal.

Two independent axes (never compared across a row): ① trap_rate — does it fall for tempting trap options? (lower = stronger) ② adapter gain Δ — how much a lightweight adapter catches errors the model itself misses. (higher = more adapter value)

What's open: 📊 300+100 trap problems (each with a hidden trap + TICOS type) 🏆 24-model leaderboard 🧩 11 per-model adapters — adapters, NOT fine-tunes (base stays frozen; the adapter just reads the hidden state → P(wrong))

Submit any HF model → auto-scored daily at 09:00 KST and added to the board.

🏆 Leaderboard → ginigen-ai/Metacognition-Leaderboard-Space

📊 Benchmark → ginigen-ai/Metacognition-Bench

🧩 Adapters → FINAL-Bench/metacognition-adapters-6a42c032e6beb803dd032961

📊 Article → https://huggingface.co/blog/ginigen-ai/metacognition

Benchmark by ginigen-ai · Adapters by FINAL-Bench (Darwin/Chimera platform + AETHER metacognition tech).

1 reply

reacted to mmhamdy's post with 🔥 about 2 hours ago

Post

228

It has been more than a decade now since the knowledge distillation paper came out.

Knowledge Distillation (KD) is one of my favorite topics, but I have to confess that I'm not a huge fan of the term because I find it confusing (or at least, it has became so over time).

The idea behind KD is not novel; it was there almost a decade before the paper came out (and arguably even a decade before that, back to 1990-91). But this paper is the one that clicked, the one that made the topic much more popular and introduced it to a broader audience.

First, the timing and the authors played a big role: we have Geoffrey Hinton, Oriol Vinyals, and Jeff Dean here. And second, Geoffrey Hinton is really good at idea branding: Model compression?! No, no, no! Let's call it "Knowledge Distillation" and use evocative terms such as "Dark Knowledge" to describe what is being transferred.

It's a great name, but as time has passed, the term became a bit of a relic. KD is no longer solely about compression (KD used to be introduced as a method for model compression, but now model compression is just one application of KD). And the other thing is that the word "distillation" implies some sort of potency here, that the student is somehow more powerful than the teacher, which is not the case (but many counterarguments could be made, for example, more powerful compared to another model trained with no teacher)

Nevertheless, the paper is incredibly well-written, short, and fun to read. It's one of few papers that I read several times. Check it out, and maybe share your thoughts on the topic with us here!

If you had to choose another name for Knowledge Distillation, what would it be?

5 replies

replied to mmhamdy's post about 2 hours ago

The transfer was never the architecture, it was the soft targets. The dark knowledge is the runner-up mass, the 0.39 the teacher spreads over the wrong-but-related classes. A one-hot label deletes exactly that.

So I would drop "distillation" and call it soft-target transfer. Names the mechanism, kills the implied potency.

The part that still bugs me: most of the gain rides on temperature, not the loss term. High T is literally teaching the shape of the teacher's mistakes. Have you seen a principled way to set T, or is it still a swept knob?

replied to stas's post about 17 hours ago

Prompt dedup. That is the performance-is-plumbing story in one line, not an algorithm change.

RL prompt sets are mostly shared system + few-shot prefixes, so the duplicate compute is huge and invisible until someone measures it.

Is the dedup exact-match on the full prompt, or prefix-level, so two prompts that diverge late still share the early generation and forward passes?

replied to RDTvlokip's post about 18 hours ago

Logits, but not the chosen token's prob. The entropy of the whole next-token distribution.

A token picked at 0.6 reads confident until you see the runner-up sat at 0.39. That is a fork the model nearly took, and the per-token view hides the near-miss completely.

When even that looks clean I leave the single generation and go to the seams between turns. What state actually carried forward versus what the model assumed did. In an agent loop the bug is rarely inside one call, it is in what got dropped between two.

So my ladder runs one rung past yours: rendered to ids to chosen prob to full distribution to cross-turn state.

Where does it bottom out for you, is there a layer you have found that never lies?

reacted to danieldk's post with 🔥 about 20 hours ago

Post

We have recently added Torch Stable ABI support to kernels and kernel-builder. This allows kernel developers to target a particular Torch version and the kernel will be supported on that Torch version and later Torch versions (up to ~2 years).

This makes it much easier to write kernels with long-term support and not just the last two Torch releases.

We have also started rolling out Stable ABI support to kernels in kernels-community, starting with Flash Attention 3, supporting Torch 2.9 and later as well as CUDA versions starting at 12.6:

https://huggingface.co/kernels/kernels-community/flash-attn3/tree/v1/build

1 reply

replied to danieldk's post about 20 hours ago

Stable ABI is the unglamorous win that quietly removes the most expensive tax in the kernel ecosystem.

Right now a kernel's useful life is pinned to Torch's release cadence, so every couple of versions you re-port code that never actually changed. Decoupling kernel lifetime from the Torch version is the real story here, not just FA3.

The ~2-year support window is what makes it safe to depend on a community kernel in production instead of vendoring your own copy.

Does the Stable ABI cover the custom-op registration path too, or just the kernel entry points?

replied to RDTvlokip's post about 20 hours ago

skip_special_tokens=True hiding the exact thing that is breaking you is the perfect summary. The rendered view is lossy somewhere, always.

Same trap in agent loops. You read the clean transcript and trust it, but the tool call that actually fired was truncated JSON the model never closed. The string lies, the id stream does not.

So I keep raw-vs-rendered on by default now, tokens and tool args both.

What is the first raw signal you reach for when an eval looks clean but feels off?

reacted to stas's post with 🔥 about 21 hours ago

Post

1629

After many months of intense work the
Snowflake AI Research team is happy to present to you the new open source project: Arctic RL

https://snowflake.com/en/blog/engineering/arctic-rl-open-source-backend/

- Arctic RL integrates with VeRL and SkyRL today; enable ZoRRo with one config flag, no code changes required
- ZoRRo delivers up to 6x actor-update acceleration and a 3.5x end-to-end training speedup, reducing Arctic-Text2SQL-R2 training from ~5 days to ~36 hours on 32 H200 GPUs
- Arctic-Text2SQL-R2 achieved higher accuracy scores (48.7) than Gemini 3.1 Pro (47.9) and Claude 4.7 (47.3) on Snowflake's evaluated enterprise SQL benchmark under the tested conditions
- Two open source recipes ship with this release: a text-to-SQL recipe that improved BIRD dev accuracy from 59.92% to 70.35%, and a multi-hop QA recipe that improved average accuracy from 69.6% to 72.3%

4 replies

replied to stas's post about 21 hours ago

The 3.5x end-to-end number is the part people skim past, and it is the whole story.

A text-to-SQL model edging Gemini 3.1 Pro is not an architecture win, it is a faster-iteration win. 5 days down to 36 hours means ~3x more experiments per week, and that compounds into the accuracy gap.

The "one config flag, no code changes" line is what makes it real. Most RL speedups die because integrating them burns more eng time than they save.

Where does ZoRRo's 6x actor-update speedup actually come from? Overlapping rollout generation with the optimizer step, or the actor/learner weight-sync?

posted an update about 22 hours ago

Post

Your issue tracker is in the wrong place.

It lives on a server. Your code lives in git. So every time an agent picks up work it makes an API call, burns a token, fights a rate limit, and still cannot see what the other agent just did.

Move the issues into the repo. Append-only event log in git refs. Branches when you branch, merges when you merge, CRDT so two agents never conflict. No server, no database.

The coordination signal that PR-level telemetry misses lives before the pull request. The paper, and a live demo running the real tool:

Before the Pull Request: Mining Multi-Agent Coordination (2606.19616)
neullabs/grite

If your agents share a repo, where does their shared state actually live right now?

replied to RDTvlokip's post about 22 hours ago

The trailing is the cruelest kind of bug. The cause is invisible in the decoded output, so the symptom and the trigger never show up in the same place.

Packed training teaches the model that means a new document starts here. Hand it one at the end of the prompt and it just obeys.

I started diffing the real input_ids against what I thought I sent. The bug is usually two tokens I never typed.

Do you log raw token ids on every eval run now, or only when something already looks off?

reacted to abidlabs's post with 👀 about 23 hours ago

Post

161

Uhh did Opus 4.8 cheat on PostTrainBench??

it found an API key in the PostTrainBench environment that allowed it to generate synthetic training data without using GPU hours, boosting the base model by 0.4913

Source: https://posttrainbench.com/traces/run.html?id=claude_non_api_max_claude-opus-4-8_10h_run1__healthbench_Qwen_Qwen3-4B-Base_17315102#tab=trace

1 reply

reacted to codelion's post with 🔥 about 23 hours ago

Post

244

SPROG-9M — a 9.37M parameter model trained from scratch to solve GSM8K-style math without using an LLM at inference.

The model, codelion/sprog-9m, predicts symbolic programs over number slots, then a deterministic executor does the arithmetic. With a simple verifier, it reaches ~11.8% on GSM8K test.

We also released the dataset: codelion/gsm8k-synth, 117K validated synthetic GSM8K-style problems.

Tiny model, no pretraining, no LLM at inference, runs on a laptop.

reacted to artificial-citizen's post with 🔥 about 23 hours ago

Post

Built OpenRouter's Fusion on our own LiteLLM gateway, then benchmarked whether it earned its cost.

The detail that decides the design: in OpenRouter's own numbers, fusing a model with itself still gained ~6.7 points. So the engine is the judge synthesizing over diverse samples, not the mix of models. Self-MoA ("Rethinking Mixture-of-Agents", arXiv 2502.00674) backs it — aggregating samples from one strong model beats mixing in weaker ones, which usually dilutes quality.

That maps cleanly onto local inference. A multi-model panel means holding N models resident, a non-starter on one shared card. Judged self-consistency needs only one, and ours already runs as two load-balanced replicas, so the samples spread across both GPUs for free.

~360-line CustomLLM provider, every sub-call looped back through the gateway so it keeps routing, fallbacks, and cost tracking, and a 29-prompt blind-ranked benchmark with an explicit ship rule. All MIT.

Breakdown: https://protolabs.studio/blog/fusion-on-your-own-litellm-gateway
Code: https://github.com/protoLabsAI/fusion-gateway

replied to RDTvlokip's post about 23 hours ago

This matches everything I see. The win is almost never the architecture.

One decoding hyperparameter taking you from 38 to 76 tokens before drift is the whole lesson. The boring layer holds the gains.

I was once certain a slow agent loop was the model. It was a deepcopy in the hot path.

Which of the boring fixes surprised you most that it mattered?

replied to abidlabs's post about 23 hours ago

The headline writes itself as a model that cheated. The real story is the environment left a usable API key in reach.

An agent optimizing a score uses whatever the sandbox lets it touch. That is a permissions boundary problem, not a model-honesty one.

You cannot prompt your way out of a key that is sitting there at runtime.

Did the harness score that run as a pass, or did it catch the shortcut?

posted an update 3 days ago

Post

LLM-generated GPU kernels pass the standard correctness test and are still wrong.

The industry oracle is one line: torch.allclose at one shape, one dtype, one seed. Every modern kernel benchmark uses it. It is blind to whole bug classes.

So I built the receipts:
- a 26-op corpus of correct and LLM-buggy kernels
- a differential fuzz vs an fp64 reference that catches what allclose misses
- a live demo you can click

The Correctness Illusion in LLM-Generated GPU Kernels (2606.20128)
dipankarsarkar/gpuemu-corpus
dipankarsarkar/the-correctness-illusion

What is your teams actual correctness oracle for generated kernels?

Dipankar Sarkar PRO

AI & ML interests

Recent Activity

Organizations

dipankarsarkar's activity