Thanks for the thoughtful comment! For now, I'm of the opinion that SaaS embedding API's are cheap enough that even a large dataset can be re-vectorised. For example, for the 143k chunks the costs were anywhere between around $6 - $30 (from memory). That's every High Court judgement up to 2023 in Australia. Personally I think of the vectors themselves as essentially disposable, since there's better models coming out every month or so. I know not everyone is of a similar mindset, and for ultimate control you'd definitely want to go local.
Adrian Lucas Malec
adlumal
AI & ML interests
None yet
Organizations
None yet
replied to
their
post
2 months ago
posted
an
update
3 months ago
Post
2433
I benchmarked embedding APIs for speed, compared local vs hosted models, and tuned USearch for sub-millisecond retrieval on 143k chunks using only CPU. The post walks through the results, trade-offs, and what I learned about embedding API terms of service.
The main motivation for using USearch is that CPU compute is cheap and easy to scale.
Blog post: https://huggingface.co/blog/adlumal/lightning-fast-vector-search-for-legal-documents
The main motivation for using USearch is that CPU compute is cheap and easy to scale.
Blog post: https://huggingface.co/blog/adlumal/lightning-fast-vector-search-for-legal-documents
reacted to
abdurrahmanbutler's
post with โค๏ธ
3 months ago
Post
2539
๐ I am excited to share news of a project my brother, Umar Butler, and I have been working on for what feels like an eternity now.
๐๐ง๐ญ๐ซ๐จ๐๐ฎ๐๐ข๐ง๐ ๐๐๐๐ โ ๐ญ๐ก๐ ๐๐๐ฌ๐ฌ๐ข๐ฏ๐ ๐๐๐ ๐๐ฅ ๐๐ฆ๐๐๐๐๐ข๐ง๐ ๐๐๐ง๐๐ก๐ฆ๐๐ซ๐ค.
A suite of 10 high-quality English legal IR datasets, designed by legal experts to set a new standard for comparing embedding models.
Whether youโre exploring legal RAG on your home computer, or running enterprise-scale retrieval, apples-to-apples evaluation is crucial. Thatโs why weโve open-sourced everything - including our 7 brand-new, hand-crafted retrieval datasets. All of these datasets are now live on Hugging Face.
Any guesses which embedding model leads on legal retrieval?
๐๐ข๐ง๐ญ: itโs not OpenAI or Google - they place 7th and 9th on our leaderboard.
To do well on MLEB, embedding models must demonstrate both extensive legal domain knowledge and strong legal reasoning skills.
https://huggingface.co/blog/isaacus/introducing-mleb
๐๐ง๐ญ๐ซ๐จ๐๐ฎ๐๐ข๐ง๐ ๐๐๐๐ โ ๐ญ๐ก๐ ๐๐๐ฌ๐ฌ๐ข๐ฏ๐ ๐๐๐ ๐๐ฅ ๐๐ฆ๐๐๐๐๐ข๐ง๐ ๐๐๐ง๐๐ก๐ฆ๐๐ซ๐ค.
A suite of 10 high-quality English legal IR datasets, designed by legal experts to set a new standard for comparing embedding models.
Whether youโre exploring legal RAG on your home computer, or running enterprise-scale retrieval, apples-to-apples evaluation is crucial. Thatโs why weโve open-sourced everything - including our 7 brand-new, hand-crafted retrieval datasets. All of these datasets are now live on Hugging Face.
Any guesses which embedding model leads on legal retrieval?
๐๐ข๐ง๐ญ: itโs not OpenAI or Google - they place 7th and 9th on our leaderboard.
To do well on MLEB, embedding models must demonstrate both extensive legal domain knowledge and strong legal reasoning skills.
https://huggingface.co/blog/isaacus/introducing-mleb
posted
an
update
3 months ago
Post
2460
MLEB is the largest, most diverse, and most comprehensive benchmark for legal text embedding models. https://huggingface.co/blog/isaacus/introducing-mleb