CodeScout Collection RL-trained code search agents (1.7B, 4B, 14B) that outperform 2–18× larger models using only a Unix terminal. 📄 arxiv.org/abs/2603.17829 • 12 items • Updated 3 days ago • 3
ColBERT-Zero 🐶 Collection First large-scale fully pre-trained ColBERT model using only public data, outperforming GTE-ModernColBERT and GTE-ModernBERT • 10 items • Updated 19 days ago • 19
💧 LFM2.5 Collection Collection of Instruct, Base, and Japanese LFM2.5-1.2B models. • 22 items • Updated 8 days ago • 105
LateOn-Code 💻 Collection State-of-the-art late interaction code retrieval models • 6 items • Updated 19 days ago • 16
view article Article LateOn-Code & ColGrep: LightOn unveils state-of-the-art code retrieval models and code search tooling Feb 12 • 50
view article Article **ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models?** about 1 month ago • 19
SWE-Universe: Scale Real-World Verifiable Environments to Millions Paper • 2602.02361 • Published Feb 2 • 60
Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs Paper • 2601.17058 • Published Jan 22 • 190
GutenOCR: A Grounded Vision-Language Front-End for Documents Paper • 2601.14490 • Published Jan 20 • 37
view article Article LightOnOCR-2-1B: a lightweight high-performance end-to-end OCR model family Jan 19 • 88
HAI-DEF Concept Apps Collection Collection of concept apps built around HAI-DEF open models/libraries to inspire the community. Learn more at http://goo.gle/hai-def` • 7 items • Updated 10 days ago • 49
MedGemma Release Collection Collection of Gemma 3 variants for performance on medical text and image comprehension to accelerate building healthcare-based AI applications. • 9 items • Updated 10 days ago • 453
view article Article Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval +1 Mar 22, 2024 • 128