Making Model Tuning Accessible: This is what we built observing 1000s of users tune models!
tl;dr; If a novice nails it in few shots, your tool nailed ease of use!
From an ease-of-use perspective, we analyzed thousands of tuning jobs across a diverse user base. Here’s what we found:
- Not all users are tuning experts. Many arrive with a model, clean datasets, and a benchmark—but lack deep knowledge of fine-tuning.
- Tuning stacks evolve rapidly. Frequent releases introduce new optimizations and features, adding more user-facing knobs.
- Power vs. complexity trade-off. While these knobs give users control, they also overwhelm them. Each knob comes with a learning curve.
- Tedious configuration. Finding the right combination for a given model and dataset is hard. Users often end up with sub-optimal configs.
- The try-fix-retry loop. Many users fall into an exhausting cycle of trial and error until they land on “something” that works.
This complexity creates friction, slows down experimentation, and limits adoption. The top 3 misconfigurations we observed were CUDA out-of-memory errors, incorrect data configurations, and missing out useful kernel optimizations that can significantly boost training efficiency. Users want to focus on outcomes—better models —without drowning in configuration details. Keeping these learnings in mind, we moved towards solving it through the tuning config recommender tool!
Supercharging Fine-Tuning with Tuning Config Recommender
Tuning Config Recommender is part of the Foundation Model Stack (FMS) ecosystem and is available here. Its built keeping following goals in mind:
Rule-Based Flexibility: Enable subject matter experts to easily add rules that operate on existing parameters, models, and datasets to output optimal configurations.
Knowledge-Driven Recommendations: Allow seamless ingestion of model- or data-specific knowledge into the tool by updating its knowledge base. This knowledge is leveraged during the recommendation flow for better accuracy.
Future-Ready Extensibility: As the stack evolves, ensure the tool can be easily extended to support new parameters without disrupting existing workflows.
Minimal Input, Maximum Output: Given as little as a model name and a set of datasets from the user, generate complete optimal tuning and data configurations that can be directly applied. Furthermore, partial or sub-optimal configurations from the user are incrementally corrected.
Explainability and Transparency: Provide textual descriptions and reasoning behind recommendations, along with a file delta view for easy comparison and application.
Design
Concepts
Intermediate Representation (IR)
IR is the standard format that all input formats have to be converted to so that rule engine can work on the IR without worrying about the exact input or output formats which could differ from one stack to another. There is a caveat with IR, not all arguments in the IR are strictly defined in the sense that the rules currently written may add some new arguments and IR or rule engine are restrictive about it. It's up to the adapter if it wishes to consume it or not for the target format.
Recommender Rule (RR)
A recommender rule takes IR as input at its current state and performs some heuristics and constructs a new IR object which is used as a JSON Merge patch by the rule-engine. Additionally, the returned new IR object can also hold various information about the patch such as severity, type and natural language comments. As shown in the architecture, a RR would be called multiple times by the rule engine until it explicitly calls out skip. When to skip is the responsibility of the RR which could be a heuristic based on the state of the IR when it's called. Some examples can be seen here.
Rule Engine
Rule engine passes the IR across rules in the sequence they are defined and collects all JSON merge patches. These JSON merge patches are then applied over the IR. This process is again iterated until all rules call out for a skip. Finally, JSON patches (is different from the merge patch) with respect to the original IR provided to the rule engine is prepared while preserving all the metadata (comments etc) for each of the patch along with the final IR to adapters. There could be edge cases where some parameters outputs of the rules could conflict, in such cases, rules with higher priority are given precedence. This priority is pre-defined in the rule-engine and can be updated if the user wishes to.
Adapter
Adapter converts source format to required IR format and consumes final IR and JSON patches as needed to deliver the target format. Adapters can be found here.
Understanding the landscape of fine-tuning config parameters
Fine-tuning stacks come with loads of config parameters, for instance, Hugging Face trainer hosts around 120+ parameters while custom training stacks like fms-hf-tuning with accelerations host even more. We bucket the parameters into the following categories based on their nature of impact on a training run.
- Critical. e.g. data pre-processing arguments based on type of data (chat, QA etc.)
- Functional. e.g. learning rate
- Non-functional (Optimisation, logging etc) e.g. applying op kernels like RMSNorm Kernel etc.
Critical parameters are those that, if missing or misconfigured, will cause a run-time failure in the training workflow. For example, incorrect or absent data preprocessing can prevent the trainer from consuming batches of samples, leading to an immediate failure. Functional parameters, while not mandatory for execution, govern the semantics of training. A misconfigured learning rate, for instance, won’t crash the workflow but can severely impact loss convergence and overall model performance. Finally, non-functional parameters—such as optimization flags (e.g., replacing native PyTorch ops with custom kernels) or logging configurations like logging_steps—do not affect training semantics. However, they influence non-functional aspects such as efficiency, observability and others. We acknowledge that some parameters may span multiple categories, depending on context.
Top 3 Failures
While the recommender is capable of generating configs that could solve a variety of issues, we have filtered down to the below top 3 failures that we have observed which could have been mitigated with right recommendations.
Overcoming CUDA out of memory (OOM)
Proactively identifying CUDA out-of-memory (OOM) errors is a challenging problem. To address this, we’ve designed a decision flow that iteratively adjusts the batch size until the configuration falls within safe bounds, minimizing the risk of OOM. The memory estimator module can be our in-house estimator or a Hugging Face accelerate estimator. While this approach isn’t entirely foolproof, it effectively handles most scenarios where sequence lengths and batch sizes are set excessively high. We encapsulate this logic as a recommender rule in our system here.
Incorrect data pre-processing config
Language models are commonly pre-trained and optionally instruction-tuned in a few popular formats. Converting fine-tuning data into one of these formats can lead to better model improvements and deliver quality benefits during downstream inference when the same format is preserved. Common formats include chat templates, question–answer pairs, and instruction-based structures. More details on the data pre-processing for fine-tuning workloads through fms-hf-tuning can be read here. The blog captures one possible way to define pre-processing flow as YAML config file. Our rules based on this format, however, can configured for custom data config formats. For each dataset, we apply heuristics to determine the most relevant format, identify the appropriate columns and their names, and then generate a preprocessing configuration. This configuration converts the data into the required format, followed by tokenization and masking for training. These flows are encoded as data rules here.
Missing out tuning optimizations
Tuning optimizations parameters are often non-functional and optional for the user. However, those optimizations can provide immense benefits that can’t be ignored given expense involved in using GPU time and time constrained tuning goals. Thus, we have included multiple recommender rules to perform heuristics over the given model and setup to apply relevant optimizations. For instance, if the model is one of the supported architectures, we automatically recommend flag to apply kernel replacements.
FMS Recommender: Integrating Tuning Config Recommender into FMS-HF-Tuning
To demonstrate how the Tuning Config Recommender can simplify real-world workflows, we integrated it into the fms-hf-tuning stack through a lightweight wrapper script called fms-recommender.py. This integration shows how the recommender’s architecture—built on IR, rules, and adapters—can plug into an existing fine-tuning stack without disrupting its workflow.
The wrapper has two straightforward modes designed for fast iteration:
Preview mode (--preview)
Shows the recommended command and the explanations/diffs produced by the recommender.
Does not run the command—ideal for CI checks, PR reviews, or quick sanity passes.
Execute mode (default, no --preview)
Runs the recommended command immediately.
Best when you’re confident in the inputs and want to kick off the job right away.
⚠️ Note: If you omit --preview, the script will execute the suggested command.
How It Works
The wrapper acts as a bridge between user intent and the recommender engine:
User Input: The CLI accepts familiar arguments like --model_name_or_path, --training_data_path, and --tuning_strategy.
Adapter Conversion: The FMSAdapter converts these inputs into the recommender’s Intermediate Representation (IR).
Rule Engine Processing: The recommender applies heuristics and rules to generate JSON patches that fix errors and optimize settings.
Command Synthesis: The wrapper uses the final IR and patches to construct a runnable accelerate launch command for the fms-hf-tuning stack.
Here is an Example using ""--preview" mode:
python fms-recommender.py --preview tuning.sft_trainer \
--model_name_or_path ibm-granite/granite-4.0-h-350m \
--training_data_path tatsu-lab/alpaca \
--tuning_strategy full
After you run the wrapper in preview mode, you don’t just get a plain command—you get a fully resolved launch command along with the paths to generated configs.
Here’s an example output:
fms-recommender.py --preview mode

accelerate_config.yaml
Execute Mode:
While preview mode is great for validation, most users want to run the job immediately once they’re confident in the inputs. That’s where execute mode comes in.
Simply omit the --preview flag, and the wrapper will:
Generate the recommended configs (accelerate_config.yaml, data_config.yaml) using FMSAdapter.
Apply all validated flags and optimizations suggested by the recommender.
Launch the fine-tuning job automatically using accelerate.
python fms-recommender.py tuning.sft_trainer \
--model_name_or_path ibm-granite/granite-4.0-h-350m \
--training_data_path tatsu-lab/alpaca \
--tuning_strategy full
training running with fms-recommender with execute mode

Conclusion: Making Fine-Tuning Effortless
We have attempted to solve one of the critical problems in ease-of-use space for fine-tuning through tuning config recommender. Further, discussed various problems from our observations on 1000s of tuning runs and systematically discussed the solution. With FMS Recommender wrapper, we have shown how we plugged the recommender to an existing tuning stack (fms-hf-tuning). We hope the community develops on top of these ideas and system to benefit their tuning users!
👉 Explore the repo here: foundation-model-stack/tuning-config-recommender
Acknowledgements
We further thank - Aditya U Baliga (past intern at IBM Research), Chander Govindarajan (IBM Research), Harikrishnan Balagopal (IBM Research), and Akash Nayak (IBM Research).





