AI & ML interests

None defined yet.

davidberenstein1957 
posted an update over 1 year ago
view post
Post
867
We've got a number of great community meetups coming up again where we'll be discussing the basics of getting started and using Argilla for TextCat, TokenCat/NER and RAG. We will walk you through common scenario's and everything you might need to know to get your projects started.

First meetup that is coming up: Setting up a text classification project using Argilla and SetFit!

Deploy Argilla on Spaces
Vibe check your dataset
Configure and create an Argilla dataset
Add records
Add zero-shot suggestions
Evaluate model suggestions in Argilla
Train a SetFit model

Hope to see all of you guys there and looking forward to your questions and AI use cases. Don't be shy about bringing your own issues and questions to the table. We would love to answer them.

Sign up here: https://lu.ma/31mecp34
davidberenstein1957 
posted an update over 1 year ago
davidberenstein1957 
posted an update over 1 year ago
view post
Post
2159
🎉 Exciting News: Argilla 2.2.0 is Here! 🚀

We're thrilled to announce the release of Argilla 2.2.0, packed with powerful new features to enhance your data annotation and LLM workflow:

🗨️ ChatField: Work with text conversations natively in Argilla. Perfect for building datasets for conversational LLMs!
⚙️ Adjustable Task Distribution: Modify settings on the fly and automatically recalculate completed and pending records.
📊 Progress Tracking: Monitor annotation progress directly from the SDK, including user-specific metrics.
🧠 Automatic Settings Inference: Importing datasets from Hugging Face Hub just got easier with automatic settings detection.
📋 Task Templates: Jump-start your projects with pre-built templates for common dataset types.
🔧 Background Jobs Support: Improved performance for long-running tasks (requires Redis).

Upgrade now and supercharge your data workflows!

Check out our full changelog for more details: https://github.com/argilla-io/argilla/compare/v2.1.0...v2.2.0
davidberenstein1957 
posted an update over 1 year ago
view post
Post
2035
🧶 We are launching distilabel DataCraft: get started with synthetic data using clicks and natural language!

🌊 Workflow
- Write down your custom GenAI usecase
- Automatically generate system prompts
- Create sample datasets for quick iteration
- Produce full-scale datasets with customizable parameters
- Push generated datasets directly to the Hugging Face Hub

⚡️ Powered by Argilla's distilabel and open source LLMs
🆓 Uses Free Serverless HF Inference Endpoints

💡 Use Cases:
- Fine-tuning language models for specific domains
- Creating diverse datasets for robust model training
- Rapid prototyping of AI applications
- Generating synthetic data for privacy-sensitive projects

🚀 Start crafting your custom datasets today and do it quicker, easier and more private with distilabel DataCraft!
https://huggingface.co/spaces/argilla/distilabel-datacraft
  • 1 reply
·
davidberenstein1957 
posted an update over 1 year ago
view post
Post
1653
🦀 Is your SQL a bit rusty? I just created theText To SQL Hub dataset explorer. To write SQL queries based on natural text input. Uses DuckDB, Llama 3.1 70B and the Hugging Face dataset-server API.

davidberenstein1957/text-to-sql-hub-datasets
davidberenstein1957 
posted an update over 1 year ago
view post
Post
1346
Distilabel and synthetic data community interviews - the outcomes

We've been doing some interview with community members to understand the needs surrounding synthetic data. Many thanks to the participants. Note that, given they interviewees were sourced from our community, so the results will likely represent that.

Things distilabel does well
- security and reliability by caching generations and having serializable pipelines.
- scaling up generation by parallelising inference and Anyscale Ray
- solid implementations of state of the art research papers

Things to improve
- communication about the fact we support structured generation
- customization of existing prompt implementations are difficult
- creation of new tasks prove difficult
- arguments and parameters for tasks aren't available at first glance
- the learning curve can be steep
- more tutorials that represent real-life usage

Things to note
- create small scale and large scale dataset to Millions of records
- people use synthetic data to move away from frontier model providers
- people mostly use 7B or 70B models for generating

Participate here: https://github.com/argilla-io/distilabel/issues
davidberenstein1957 
posted an update over 1 year ago
view post
Post
1503
Interested in learning about everything Image?

​With the rise of recent interest in Vision Language Models (VLMs), we decided to make a push to include an ImageField within Argilla! This means any open source developer can now work on better models for vision ML tasks too and we would like to show you how.

​We would love to introduce this new feature to you, so we've prepared a set of notebooks to go over some common image scenarios.
finetune an CLIP retrieval model with sentence transformers
use ColPali+ Qwen VL for RAG and log the results to Argilla
image-generation preference: creating multi-modal preference datasets for free using Hugging Face inference endpoints.

​See you on Thursday!

https://lu.ma/x7id1jqu
davidberenstein1957 
posted an update over 1 year ago
view post
Post
1830
🌟 Argilla v2.1.0 goes multi-modal: Image Field, Dark Mode, Enhanched Hugging Face Hub imports and more!

🖼 Image Field: Seamlessly work with multimodal datasets
🌓 Dark Mode: Reduce eye strain with our sleek new look
🤗 Enhanced Hugging Face Hub import with the SDK
🇪🇸 Spanish UI: Breaking language barriers

Plus more improvements to supercharge your model curation workflow!

Check out the full announcement for details and code examples: https://github.com/argilla-io/argilla/compare/v2.0.1...v2.1.0
davidberenstein1957 
posted an update over 1 year ago
view post
Post
293
🔥 Dataset Viber 0.3 launches with Synthesizer to synthesise data with a human in the loop, for free, using open source models with Argilla's distilabel but within a quick-and-easy Gradio Interface.

Why? Not trying to be all fancy and formal just to iterate on your data and to get familiar with your prompts and the produced data. Under the hood, it relies on Hugging Face Inference endpoints and the latest LLMs and VLMs like Meta Llama 3.1 and BlackForest Labs Flux models.

An addition to the other Interfaces that are already support.
- CollectorInterface: Lazily collect data of model interactions without human annotation.
- AnnotatorInterface: Walk through your data and annotate it with models in the loop.
- Synthesizer: Synthesize data with distilabel in the loop.
- BulkInterface: Explore your data distribution and annotate in bulk.

⭐️ Give some good vibes: https://github.com/davidberenstein1957/dataset-viber
davidberenstein1957 
posted an update over 1 year ago
view post
Post
510
🆕 🚀 🏎 fast-sentence-transformers - simply, faster, sentence-transformers

Released an initial version a while ago
Archived it because of a cleaner solution described in a blog by Philipp Schmid
Reimplemented it based on that cleaner solution
Unarchived the project
Packaged it up
Released a 0.5 version

pip install fast-sentence-transformers

https://github.com/davidberenstein1957/fast-sentence-transformers
davidberenstein1957 
posted an update over 1 year ago
view post
Post
1300
🎉 Just dropped a fresh version of dataset-viber along with some cool, Gradio-based annotators! These tools aren't about formalities—they're here to help you quickly collect feedback and get your projects moving along to a more serious stage, ahumm @argilla .

Some new features!
- manual import from a CSV or the Hugging Face Hub
- manual export to CSV or the Hub
- improved automated export to the Hub and CSV
- limit interaction with specific components
- stream data with custom next_input features (SO to Ben Burtenshaw for the suggestions)
- model in-the-loop support for all tasks

dataset-viber/gradio-annotators-66c5ce73d5e3bf99caa445b1
  • 3 replies
·
davidberenstein1957 
posted an update over 1 year ago
view post
Post
3036
🚀 We will be generating a preference dataset for DPO/ORPO and cleaning it with AI feedback during our upcoming meetup!

In this session, we'll walk you through the essentials of building a distilabel pipeline by exploring two key use cases: cleaning an existing dataset and generating a preference dataset for DPO/ORPO. You’ll also learn how to make the most of AI feedback, integrating Argilla to gather human feedback and improve the overall data quality.

This session is perfect for you
- if you’re getting started with distilabel or synthetic data
- if you want to learn how to use LLM inference endpoints for **free**
- if you want to discover new functionalities
- if you want to provide us with new feedback

Sign up here: https://lu.ma/dt0c7jru
davidberenstein1957 
posted an update over 1 year ago
view post
Post
1782
📣 Introducing Dataset Viber: your chill repo for data collection, annotation and vibe checks! 🎉

I've cooked up Dataset Viber, a set of cool tools designed to make data preparation for AI models easier, more approachable and enjoyable for standalone AI engineers and enthusiasts.

🔧 What Dataset Viber offers:
- CollectorInterface: Lazily collect model interaction data without human annotation
- AnnotatorInterface: Annotate your data with models in the loop
- BulkInterface: Explore data distribution and annotate in bulk
- Embedder: Efficiently embed data with ONNX-optimized speeds

🎯 Key features:
- Supports various tasks for text, chat, and image modalities
- Runs in .ipynb notebooks
- Logs data to local CSV or directly to Hugging Face Hub
- Easy to install via pip: pip install dataset-viber

It's not designed for team collaboration or production use, but rather as a fun and efficient toolkit for individual projects.

Want to give it a try? Check out the repository link https://github.com/davidberenstein1957/dataset-viber/.

I'm excited to hear your feedback and learn how you vibe with your data. Feel free to open an issue or reach out if you have any questions or suggestions!

Some shoutouts:
- Gradio for the amazing backbone
- Daniel van Strien for some initial presentations I did on vibe checks
- Emily Omier for the workshop on structuring GitHub repo READMEs
- Hamel Husain for keeping mentioning that people should look at their data.
- Philipp Schmid for his code for ONNX feature-extractors
- Ben Burtenshaw for the first PR
  • 1 reply
·
davidberenstein1957 
posted an update over 1 year ago