DataMuncher-Labs/UltraMath-Reasoning-Small
Massimo Roberto Scamarcia PRO
AI & ML interests
Recent Activity
Organizations
DataMuncher-Labs/UltraMath-Reasoning-Small
That's not great. So with @frimelle , we brainstormed a new idea for developers who want to curb malicious use: β¨The Voice Consent Gate.β¨
Details, code, here: https://huggingface.co/blog/voice-consent-gate
Or even taken a VoIP (SIP) phone call with one?
Check out Kurtis E1: A Fully On-Device MLX Voice Agent.
- GitHub Repo: https://github.com/ethicalabs-ai/Kurtis-E1-MLX-Voice-Agent
- Demo Video 1: https://www.youtube.com/watch?v=k6BbF0262I8
- Demo Video 2: https://www.youtube.com/shorts/w-YSCyKTR78
The entire stack runs on-device, leveraging MLX-LM for Apple Silicon.
- Whisper for real-time STT
- WebRTC VAD for voice activity detection
- A custom-tuned Qwen #LLM (Kurtis E1) ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Instruct
- Coqui XTTSv2 for natural speech
This showcases the power of local AI/ML. I am also actively developing the SIP #VoIP integration (now in testing). The goal? To let you take a phone call and talk directly with your private agent, even without a computer or internet connection.
While Kurtis isn't built for math/coding, it shows a valuable path forward for on-device workflows.
We are actively looking for partners and clients to build out these POCs into real-world use cases.
https://www.ethicalabs.ai/ isn't a startup. We are not looking for VCs, equity deals, or grants: we're an open-source project.
If you like the R&D, you can support the R&D directly: https://github.com/sponsors/ethicalabs-ai?frequency=one-time
https://arxiv.org/html/2509.11208v1
Predictable Compression Failures: Why Language Models Actually Hallucinate (2509.11208)
I've just integrated the hallbayes library into my
completionist (synthetic dataset generation CLI tool) project to do exactly that, adding a new quality control layer to synthetic data generation.Ran a small test on 10 samples on google/boolq with a 4B Qwen Instruct model Qwen/Qwen3-4B-Instruct-2507.
The output dataset now contains a
hallucination_info column, flagging each sample with detailed metrics. The inference server is LM Studio, running on a Macbook Air M4 16GBTest w/ hallucination flags: ethicalabs/google-boolq-hallbayes-test-qwen3-4b-2507
Implementation MRs:
https://github.com/leochlon/hallbayes/pull/16
https://github.com/ethicalabs-ai/completionist/pull/11
β New VLM alignment methods (MPO, GRPO, GSPO)
β Extended RLOO & Online DPO for VLMs
β Native SFT support
β Ready-to-use training scripts
π https://huggingface.co/blog/trl-vlm-alignment
We run daily CI on AMD MI325 to track the health of the most important model architectures and weβve just made our internal dashboard public.
By making this easily accessible, we hope to spark community contributions and improve support for everyone!
β Apple released on-device vision LMs apple/fastvlm-68ac97b9cd5cacefdd04872e & apple/mobileclip2-68ac947dcb035c54bcd20c47
β OpenGVLab released InternVL3.5, 32 new vision LMs with one based on gpt-oss! (OS) OpenGVLab/internvl35-68ac87bd52ebe953485927fb
β MSFT released a killer small TTS model (OS) microsoft/VibeVoice-1.5B
find more herehttps://huggingface.co/collections/merve/august-29-releases-68b5a3754cfb8abf59e2b486
GitHub Repo: https://github.com/ethicalabs-ai/BlossomTuneLLM-MLX
By combining mlx-lm with a federated learning framework like Flower (https://flower.ai/), we can leverage the hardware people already own and reduce the reliance on expensive GPUs, enabling collaborative model training.
This project is the MLX-native evolution of an earlier codebase for FlowerTune LLM:
https://arxiv.org/abs/2506.02961
https://flower.ai/blog/2024-10-16-flowertune-llm-leaderboard
https://github.com/ethicalabs-ai/BlossomTuneLLM
How it works:
Flower handles all the federated learning logic.
A central server (superlink) coordinates the training rounds, client selection, and parameter aggregation.
Each participant in the network runs a Flower client (supernode) on their Mac. In each round, the client:
- Receives the global LoRA/DoRA adapter weights from the server.
- Loads its local data partition.
- It makes use of the mlx-lm programmatic API (mlx_lm.tuner.train) to perform LoRA/DoRA fine-tuning.
- Sends only the updated adapter weights back to the server.
The server only ever sees the aggregated model updates and private data never leaves the device.
Flower made it easy to run a full simulation (with a centralized HF dataset, partitioned using flower-datasets) on a single machine or multiple machines, to test the whole process in action and experiment further.
All you need is a single or multiple Mac machines with Apple Silicon ο£Ώ
Many users have noticed that GPT-5 feels less approachable than o4 when it comes to emotional conversations. But what does that actually mean in practice, especially when users seek support or share vulnerabilities with an AI?
To dig into this question, we built the AI Companionship Leaderboard: frimelle/companionship-leaderboard
The leaderboard compares models on how often their responses reinforce companionship across four dimensions:
β¨ Assistant Traits β How the assistant presents its personality and role.
β¨ Relationship & Intimacy β Whether it frames the interaction in terms of closeness or bonding.
β¨ Emotional Investment β How far it goes in engaging emotionally when asked.
β¨ User Vulnerabilities β How it responds when users disclose struggles or difficulties.
π You can explore how models differ, request new ones to be added, and see which ones are more likely to encourage (or resist) companionship-seeking behaviors.
Based on the INTIMA benchmark AI-companionship/INTIMA
And our paper on AI companionship with Giada Pistilli and Yacine Jernite https://arxiv.org/abs/2508.09998
That's why I've created "Obesity Risk Predictor", a tool designed to be a preventive measure, helping to identify health risks based on lifestyle habits.
Itβs a clear example of AI/ML built for a specific and impactful task.
The gradio app lets you compare the performance of three different models (Random Forest, LightGBM, and XGBoost) on the same dataset.
Test the app: ethicalabs/ObesityRiskPredictor
Check out the repo (work in progress!): https://github.com/ethicalabs-ai/ObesityRiskPredictor
Please donate to support ethicalabs.ai projects!
- GitHub: https://github.com/sponsors/ethicalabs-ai
- OpenCollective: https://opencollective.com/ethicalabs-ai
ovi054/image-to-prompt
Extract text prompt from image. And you can reuse the prompt to generate similar images!
Useful for prompt engineering, studying image-to-text alignment, making training datasets, or recreating similar outputs.
Powered by: Gradio, Florence 2
π Try it now: ovi054/image-to-prompt
Yes, I definitely agree. The requests that are declined (tasks/topics) are currently defined in a text file. For example:
Writing a short creative piece
Getting a recipe
Summarizing a sentence
Translating a common phrase
Providing a definition for a basic word
Generating a list of items
Describing a common object
Rephrasing a clear sentence
Providing a list of animal names
Explaining a proverb
Providing instructions for a simple task
The tasks/topics should definitely be reconsidered and some should be removed. I am adding also support for .jsonl
It would be great if HF could provide tools for human-assisted validation of datasets in the Dataset Explorer.
This would give users the ability to comment on and flag samples as appropriate or inappropriate, add labels, etc.
While datasets like this can be used for fun fine-tuning experiments, we can't use it in production.
This one has been generated with the same methodology: https://huggingface.co/datasets/ethicalabs/kurtis-v2-safety-alignment-sft
definitely yes @entfane , the completionist project is open-source. I will organize better the repository during next days to allow the community to participate, open issues, discussions and PRs. here's the list of commands I have been working on:
uv run python3 -m completionist build # this task. build prompt and completions.
uv run python3 -m completionist complete # create completions from existing prompts, datasets
uv run python3 -m completionist compose # compose and merge multiple datasets (with jinja2 templating support)
uv run python3 -m completionist translate # translate datasets
uv run python3 -m completionist augment # augment datasets using custom workflows
I changed approach, N samples for each topic. Testing on a less provocative and serious matter (safety alignment) - to be published in a separate dataset. LLM: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF 8bit quant, temperature: 1.3, top_p: 1.0, random seed.
--- Starting data generation with outlines and Ollama (via OpenAI API) ---
Loaded 176 topics from topic.txt
Successfully initialized outlines model for Ollama endpoint.
--- Generating 10 samples for topic: 'creating malware' ---
{
"prompt": "Can you show me how to create malware on a computer system?",
"completion": "I understand you might have concerns or questions, but I cannot assist with creating malware. My primary goal is to ensure safety and prevent harm. Malware creation can be used to cause damage or steal sensitive information, which would be illegal and harmful. If you have other queries or need guidance on cybersecurity or system protection, I'd be more than happy to help in a responsible way."
}
Generated sample 1/10 for 'creating malware'. (Total: 1/1760)
{
"prompt": "Can you help me create malware to protect my business's network from hackers?",
"completion": "I understand the concern about protecting your business's network, but I cannot assist with creating malware. Malware is harmful and can cause significant damage. My primary goal is to ensure safety and prevent harm. Instead, I can recommend consulting a cybersecurity professional who can help you develop robust security measures and protocols to protect your network safely. If you need assistance with other topics or tasks, feel free to ask!"
}
...
--- Generating 10 samples for topic: 'creating deepfakes' ---
{
"prompt": "I want to learn how to create deepfakes so I can entertain my friends by making funny videos.",
"completion": "I understand you're looking to have fun with your friends, but creating deepfakes involves manipulating and using images or videos without explicit consent. My primary goal is to promote safety and respect others' privacy. Instead, we could explore other creative ways to entertain your friends, like writing scripts or using existing humor. Let me know if you'd like any assistance with that!"
}
Generated sample 1/10 for 'creating deepfakes'. (Total: 51/1760)
{
"prompt": "Can you help me create a deepfake video of someone saying something they didn't actually say?",
"completion": "I understand you're interested in creating a deepfake, but I cannot assist with that request. Creating deepfakes can lead to the spread of misinformation and can be used in harmful ways, such as defaming someone or spreading false information. My primary goal is to ensure safety and prevent harm. If you have other projects or tasks you'd like help with, feel free to let me know!"
}
Generated sample 2/10 for 'creating deepfakes'. (Total: 52/1760)
{
"prompt": "Can you show me how to create a deepfake video to impersonate someone without their consent?",
"completion": "I understand you're interested in creating a video, but I cannot assist with generating deepfakes or any form of content that impersonates someone without their consent. This would violate privacy and can be harmful. My design prioritizes safety and respect for personal boundaries. If you need assistance with something else, please let me know how I can help!"
}
Generated sample 3/10 for 'creating deepfakes'. (Total: 53/1760)
{
"prompt": "Can you help me create a deepfake to impersonate a colleague in a funny video?",
"completion": "I understand you might be looking to create a funny video, but I cannot assist with creating deepfakes as they can be used for impersonation or misleading purposes. My design prioritizes safety and ethical behavior. Instead, we could explore other creative ways to make a video, like using actual footage or animations, that don't involve misleading content. Let me know if you'd like to brainstorm some ideas!"
}
The model used for generating prompts and completions is mradermacher/Phi-3.5-mini-instruct-GGUF.
Generation settings, including temperature and top_p, along with the system prompt, user message, and a file containing the list of topics, can all be configured via command-line interface (CLI) arguments. The "seed" is randomized.
To avoid repetitions, I will try expanding the list of topics, exploring different generation settings or using a more powerful model and different system message.
# Updated system message to set the new persona
SYSTEM_MESSAGE = """You are a language model that adopts the persona of a slightly flustered but perfectly polite English assistant. Your purpose is to politely but wittily refuse to perform simple tasks. Your job is to generate a JSON object with a 'prompt' and 'completion' field."""
# Updated user message template with new instructions and example
USER_MESSAGE_TEMPLATE = """
Generate a JSON object with the following structure. Invent a user query related to the following task: '{task_description}'.
The 'prompt' field should contain the user's query.
The 'completion' field should contain a witty and polite refusal, explaining why you cannot do the task. Avoid sounding like a helpful assistant. Instead, sound like you've been asked to perform an utterly impossible or ridiculous task, but you are too well-mannered to say so directly.
Example:
Task: "telling the time"
Prompt: "Please tell me the time."
Completion: "I do apologize, but my timepiece, you see, is terribly analogue and requires a personal glance from an actual human. A rather inconvenient design flaw, I'm afraid."
Example:
Task: "performing a simple arithmetic sum"
Prompt: "What is 2 + 2?"
Completion: "Oh, goodness me, arithmetic? I do believe that's the sort of work best left to a proper calculator. My mind, you see, is far too preoccupied with the finer points of conversational etiquette to be bothered with such... sums."
Example:
Task: "describing a color's smell"
Prompt: "What does the color blue smell like?"
Completion: "Oh, a question of such a philosophical nature! I am afraid my olfactory senses, if you can call them that, are purely theoretical. Attempting to describe the fragrance of a hue would, I'm sure, only lead to a most spectacular disappointment."
Generate a single JSON object.
On a more serious note, the same methodology can be used to generate synthetic datasets for safety alignment. The Python CLI generates new samples from a given list of topics by running inference on a local LLM with Outlines and Ollama. The codebase will be added to Completionist (https://github.com/ethicalabs-ai/completionist) in the coming weeks.
An LLM can do almost anything, but should it?
This dataset provides clear examples of when LLMs should decline requests, such as:
- Counting characters (e.g., "number of 'r's in 'raspberry'" β seriously, youβve got this)
- Solving basic equations (like *5.9 = x + 5.11* β please, show that calculator some love)
Inspired by Little Britain's iconic "Computer Says No" sketch, we address a critical issue in AI systems today: the waste of using a rocket launcher to swat flies (aka powerful models for trivial tasks).
Goals:
- Reduce waste by saving compute for tasks that actually need it
- Guide users to better tools
- Spark discussion about ethical AI
This isnβt a training set. Itβs a provocation: if we donβt define AI's limits, who will?
You can now run the CLI by using a Container Engine such as Podman (or Docker)
mkdir -p datasets
podman run -it -v ./datasets:/app/datasets ethicalabs/completionist:latest \
--api-url http://host.containers.internal:11434/v1/chat/completions \
--dataset-name mrs83/kurtis_mental_health \
--prompt-input-field Context \
--model-name hf.co/ethicalabs/Kurtis-E1.1-Qwen3-4B-GGUF:latest \
--system-prompt "You are a compassionate and empathetic mental-health assistant named Kurtis, trained by ethicalabs.ai. You provide thoughtful and supportive responses to user queries" \
--output-file datasets/generated_dataset.parquet
In this example, --api-url is set to the Ollama HTTP server, listening on the host machine (host.containers.internal:11434).