chen459664's picture
Add files using upload-large-folder tool
7036b3d verified
|
raw
history blame contribute delete
16.8 kB
# Overview
The **lm-quant-toolkit** is a suite of tools to facilitate large neural network
quantization research. It includes a quantization harness tool to drive
quantization experiments on large language models and vision models, to collect
and summarize experiment data for further analysis. It also includes tool to
prepare experiment meta data and visualization tools to interpret experiment
results. Specifically, lm-quant-toolkit consists of:
- LLM quantization harness tool
- ViT quantization harness tool
- FNorm Metadata Preparation Tool
- Kurtosis Metrics Measuring Tool
- Sensitivity Score Measuring Tool
- Calibration Dataset Generation Tool
- Visualization Tools
## Citation
~~~~
@inproceedings{zhang2025mxq,
title = {A Mixed Quantization Approach for Data-Free Quantization of LLMs},
author = {Feng Zhang and Yanbin Liu and Weihua Li and Xiaodan Wang and Quan Bai},
year = {2025},
url = {https://openreview.net/forum?id=M3Y74vmsMcY},
}
~~~~
## Setup test harness
Most tools are implemented in Python and are extensively tested under the
Python 3.11.9. The visualization tools are implemented in R. The usages of
these tools are elaborated in the following sections. This section describes
how to setup the lm-quant-toolkit and the companion visualization tools.
The Python tools dependend on Python libraries such as transformers, datasets,
numpy, PyTorch etc. A few Python libraries are patched to support MXQ.
Specifically, required patched dependencies include AutoGPTQ (for CUDA 12.5
compatibility), HQQ (support MXQ extension), lm_eval (for end-to-end LLM
performance evaluation), clip_benchmark (for vision model evaluation). These
dependencies are installed automatically as part of setup process. To setup the
Python tools, follow this procedure:
- Ensure Python and miniconda are installed
- Create a Python virtual enivonrment using Python 3.11.9 and activate this enivonrment
- Clone the lm-quant-toolkit project from [the lm-quant-toolkit project][2]
- Run the script setup-harness.sh under the root directory of the lm-quant-toolkit project
Or simply use the convenient script `setup-harness.sh` included this project.
## Setup visualization tools
The visualization tools are R scripts to transform, aggregate and visualize
experiment results. They are wrapped in bash scripts to automate the whole
experiment loop, which consists of model quantization, perplex evaluation,
memory consumption test and experiment report generation. The R visualization
scripts can also be used separately. To setup the visualization tools, please
follow this procedure:
- Ensure a recent version of R, for instance R 4.4.1, is installed.
- Optionally, RStudio could be installed to extend and trouble shoot the
visualization tools in an intuitive enivonrment.
- Install the third-party packages required by the visualization tools by
running the script `setup-visualization.sh` under the root directory of the
lm-quant-toolkit project.
# Quantization tool usage
## LLM Quantization Harness Tool
This tool executes various quantization tasks and runs diverse evaluation
benchmarks such as perplexity, GPU memory usage, quantized model storage. It
also supports end-to-end LLM performance evaluation through the integration
with the `lm-eval` tool. This harness tool works with various state-of-art
quantization methods such as GPTQ, AWQ, BitsAndBytes and HQQ, which enables a
fair comparison between the proposed methods and the state-of-art baselines.
Furthermore, it facilitates the complex and time-consuming benchmarking tasks
by offering resumption from failed subtasks, aggregate subtask's evaluation
results. Lastly, this tool provides declarative CLI interface to ease complex
experiment automation through shell scripting.
## FNorm Metadata Preparation Tool
This tool calculates the Frobenius norms, a.k.a FNorm, of the quantization
errors of all weight matricies inside a particular large language model. The
FNorm meta-data are crucial to the MXQ quantization scheme as it guides MXQ to
allocate optimal quantization configurations. This tool accepts a list of
Hugging Face-compliant model identifiers. The output of this tool is a series
of .csv files under specified directory. Each file contains the Frobenius
norms for the 12 quantization configurations.
The tool is implemented in Python and provides a convenient CLI interface to
enable shell scripting. It is located separately in the `dump.py` file
under the `src` folder in the `lm-quant-toolkit` project, which helps
to reduce unnecessary dependencies. A typical usage is demonstrated in the code
snippet as follows:
~~~~bash
#!/bin/bash
TOOLKIT_DIR="$HOME/work/lm-quant-toolkit"
MODELS="meta-llama/Llama-2-7b-hf meta-llama/Llama-2-13b-hf meta-llama/Llama-3.1-8B"
mkdir -p /tmp/fnorm-dump
python $TOOLKIT_DIR/src/dump.py fnorm \
--model $MODELS \
--output-dir /tmp/fnorm-dump
~~~~
## Kurtosis Metrics Measuring Tool
This tool calculates the Kurtosis metrics of weight matricies layer-by-layer
inside a particular large language model. The Kurtosis metrcis are crucial to
identify sensitive layers to improve the accuracy of MXQ quantization. This
tool accepts a list of Hugging Face-compliant model identifiers. The output of
this tool is a series of .csv files under specified directory. Each file
contains the Kurtosis metrics for corresponding models.
The tool is implemented in Python and provides a convenient CLI interface to
enable shell scripting. It is included in the `dump.py` file under the
`src` folder in the `lm-quant-toolkit` project. A typical usage is
demonstrated in the code snippet as follows:
~~~~bash
#!/bin/bash
MODELS="meta-llama/Llama-2-7b-hf meta-llama/Llama-2-13b-hf meta-llama/Meta-Llama-3-8B"
mkdir -p /tmp/kurtosis-dump
python ../src/dump.py kurtosis \
--model $MODELS \
--output-dir /tmp/kurtosis-dump
~~~~
This code snippet demonstrates dumping the kurtosis metrics for the three Llama
models into the `/tmp/kurtosis-dump` directory.
## Sensitivity Score Measuring Tool
This tool calculates the sensitivity score of each layer of a particular large
language model. The sensitivity score are crucial to identify sensitive layers
to improve the accuracy of MXQ quantization. This tool accepts a list of
Hugging Face-compliant model identifiers. The output of this tool is a series
of .csv files, each contains the sensitivity score for corresponding model.
These files are crucial inputs to guide the SensiBoost and Sensitivity-based
MiLP.
The tool is implemented in Python and provides a convenient CLI interface to
enable shell scripting. It is compatible with any transformer-based LLMs with
an implementation of the popular Hugging Face transformers library. It is
located separately in the `dump.py` file under the `src` folder in
the `lm-quant-toolkit` project, which helps to reduce unnecessary
dependencies. A typical usage is demonstrated in the code snippet as follows:
~~~~bash
#!/bin/bash
TOOLKIT_DIR="$HOME/work/lm-quant-toolkit"
RESULT_BASE_DIR="/data/llm/mxq/results"
CALIB_DATASETS="bos pileval wikitext c4"
CONFIGS="b2g128 b2g64 b2g32 b3g128 b3g64 b3g32 b4g128 b4g64 b4g32 b8g128 b8g64 b8g32"
MODELS="Qwen/Qwen2.5-7B Qwen/Qwen2.5-Coder-7B Qwen/Qwen2.5-Coder-7B-Instruct Qwen/Qwen2.5-Math-7B"
EXP_NAME=sensi_qwen25
RESULT_DIR=$RESULT_BASE_DIR/$EXP_NAME
mkdir -p $RESULT_DIR/data
for DS in $CALIB_DATASETS; do
for CFG in $CONFIGS; do
for MODEL in $MODELS; do
SHORT_ID=$(echo $MODEL | cut -d/ -f2)
OUT_FILE="${RESULT_DIR}/data/qwen25-sensi-${SHORT_ID}-${CFG}-${DS}.csv"
python $TOOLKIT_DIR/src/dump.py sensi \
--model $MODEL \
--config $CFG \
--calib-dataset $DS \
--output-file $OUT_FILE
done
done
done
~~~~
The code snippet demonstrates how to calculate the sensitivity scores for a
series of Qwen2.5 models using 4 calibration datasets under 12 bit budgets.
## Calibration Dataset Generation Tool
This tool generates a small synthensized dataset named branch of science
(denoted as BoS, published on Hugging Face), which includes a few hundred of
textual defintions for science, art and business topics such as Mathematics,
Physics, Chemstry, Law, Music and Journalism etc. The dataset is intended to
validate if the sensitivity property generalize to diverse datasets.
The tool generates an initial dataset in .csv format which requires further
processing. The output of this tool is random due to the generative nature of
LLM. This tool requires a Llama-2-7B model being served with an OpenAI
compatible RESTful API endpoint. User can either use a hosted API endpoint or
deploy a local instance by following the instruction at the end of this section.
Once the API endpoint is secured, run the following script to generate the BoS dataset:
~~~~bash
#!/bin/bash
TOOLKIT_DIR="$HOME/work/lm-quant-toolkit"
$TOOLKIT_DIR/utils/generate.py \
--model="meta-llama/Llama-2-7b-chat-hf" \
--variant="vLLM" \
--topic-file=topics-l1.txt \
--trace
~~~~
Lastly, find the result in the csv files under current directory.
### Local API endpoint
To deploy a local API endpoint using vLLM, create a virtual environment using
`conda` as follows:
~~~~bash
conda create -n vllm python=3.11 -y
conda activate vllm
pip install vllm==0.6.4.post1
~~~~
Then configure and launch the API server
~~~~bash
#!/bin/bash
vllm serve meta-llama/Llama-2-7b-chat-hf --dtype auto --api-key token-abc123
~~~~
Watch the output vLLm to make sure it starts up successfully.
# Visualization Tool usage
The visualization tools facilitate visualizing the experiment results and the
weight distribution, and generating insights of the latent features to quantize
LLMs more efficiently. Most visualization tools are implemented in R and
leverages the open-source plot libraries such as ggplot2, circlize, ggbreak,
ggmagnify. They provide CLI interface to simplify integaration with the
quantization harness tool.
These CLI tools support diverse options to allow user specify input dataset,
select particular model or approach to plot. To get help on these specific CLI
options, type `./plot_xxx.R --help` on command line prompt. For instance,
to get help on the MXQ allocation visualization tool, you may run command as
follows:
~~~~bash
./plot-mxq-allocation.R --help
Usage: ./plot-mxq-allocation.R [options]
Options:
-h, --help
Show this help message and exit
-m CHARACTER, --model=CHARACTER
Model ID
-b DOUBLE, --budget=DOUBLE
Bit Budget
-d CHARACTER, --baseline_data_dir=CHARACTER
Data directory of baseline results
-q CHARACTER, --quant_cfg_allot_file=CHARACTER
The combined quant config allocation csv file
--attempt1=CHARACTER
The first attempt to plot
--attempt2=CHARACTER
The second attempt to plot
--fnorm
Display FNorm value in the bar chart
~~~~
## Weight Distribution Visualization Tool
This tool enables visualizing layer-wised weight distribution of large language
models. It is implemented as an R script, which provides a convenient CLI
interface to enable shell scripting. Given a weight distribution metrics csv
file, it produces a pdf file under the `pdfs` with 3x3 sub-plots of column
digrams for the 9 modules in the Llama family models.
The tool is named `plot-wdist-llm.R` and located under the
`data-vis` folder in the `lm-quant-toolkit` project. A typical usage
is demonstrated in the code snippet as follows:
~~~~bash
#!/bin/bash
TOOLKIT_DIR="$HOME/work/lm-quant-toolkit"
$TOOLKIT_DIR/data-vis/plot-wdist-llm.R -m Llama-2-7b-hf
~~~~
## Perplexity vs Bit Budget Visualization Tool
This tool enables visualizing the relationship between perplexity and bit
budget for diverse MXQ experiments against their baselines. The generated
diagram shows how memory reduction affects perplexity, which facilitates
memory-accuracy trade-off.
The tool is named `plot-ppl-mem.R` and located under the `data-vis`
folder in the `lm-quant-toolkit` project. It accepts a csv file containing
the perplexity metrics of MXQ and its baselines. The output are series of PDF
files corresponding to the models defined in the input file, which are placed
under the `pdfs` subfolder. A typical usage is demonstrated in the code
snippet as follows:
~~~~bash
#!/bin/bash
TOOLKIT_DIR="$HOME/work/lm-quant-toolkit"
$TOOLKIT_DIR/data-vis/plot-ppl-mem.R -d data/combined.csv
~~~~
## Quantization Speed Comparison Visualization Tool
This tool generates column digrams to explore the quantization speed among
various approaches. The tool is also implemented as an R script, which provides a
convenient CLI interface to enable shell scripting. The tool is named
`plot-quant-speed.R` and located under the `data-vis` folder in the
`lm-quant-toolkit` project. Given a combined perplexity metrics csv file,
it produces a column digrams with x-axis in log-scale. Similar to other tools,
the PDF file is placed under the `pdfs` subfolder. A typical usage is
demonstrated in the code snippet as follows:
~~~~bash
#!/bin/bash
TOOLKIT_DIR="$HOME/work/lm-quant-toolkit"
$TOOLKIT_DIR/data-vis/plot-quant-speed.R -d data/combined.csv
~~~~
## GPU Memory Usage Visualization Tool
This tool generates column digrams to present the actual GPU memory consumption
of LLMs quantized by diverse methods. The tool is also implemented as an R script,
which provides a convenient CLI interface to enable shell scripting. The tool
is named `plot-mem-consumption.R` and located under the `data-vis`
folder in the `lm-quant-toolkit` project. Given a combined perplexity
metrics csv file, it produces a column digrams of GPU memory usage in
Giga-byte. Similar to other tools, the PDF file is placed under the `pdfs`
subfolder. A typical usage is demonstrated in the code snippet as follows:
~~~~bash
#!/bin/bash
TOOLKIT_DIR="$HOME/work/lm-quant-toolkit"
$TOOLKIT_DIR/data-vis/plot-mem-consumption.R -d data/combined.csv
~~~~
## Quantization Configuration Allocation Visualization Tool
This tool offers insights into the way MXQ and its variants allocate bit budget
to modules and layers. The variants, a.k.a. attempt, to include in the plot are
configurable. A maximium of 4 variants can be plotted in a circular layout
thanks to plot library circlize \citep{zuguang_2014}.
The first input expected by the tool is a combined quantization configuration
allocation csv file which should include experiment outcome for diverse methods
such as HQQ and MXQ. The second parameter is the directory where Frobenius
norms csv files are located. The third parameter is the perplexity score csv
file. The tool produces circos digram in PDF format.
The tool is named `plot-circos-allot.R` and located under the
`data-vis` folder in the `lm-quant-toolkit` project. A typical usage
is demonstrated in the code snippet as follows:
~~~~bash
#!/bin/bash
TOOLKIT_DIR="../../.."
MODELS="
Llama-3-7b-hf
Llama-3-13b-hf
Meta-Llama-3-8B
"
BUDGETS="4.25 3.51"
STOP=2
TOPM=2
for MODEL in $MODELS; do
for BUDGET in $BUDGETS; do
$TOOLKIT_DIR/data-vis/plot-circos-allot.R \
--model $MODEL \
--budget $BUDGET \
--fnorm_data_dir $TOOLKIT_DIR/src/data/ \
--ppl_csv_file data/combined.csv \
--quant_cfg_allot_file data/quant-cfg-allocation.csv \
--attempt1 sensi-boost-${STOP}-${TOPM} \
--attempt2 kurt-boost-${STOP}-${TOPM} \
--attempt3 hqq\
--attempt4 mxq1
done
done
~~~~
This code snippet demonstrates how to generate a quant config allocation
comparison diagram to examine the nuanced difference between the SensiBoost and
kurtBoost approaches, with a stop of 2 and top-{m} 2, as well as the HQQ and
MXQ baselines.
## SensiBoost/KurtBoost Win-Tie-Loss Visualization Tool
This tool enables qualitative analysis of effectiveness of the proposed
SensiBoost and KurtBoost methods. It is implemented as an R script, which
provides a conventional CLI interface to ease automation.
Given a combined perplexity metrics csv file, it produces a series of column
digrams in PDF format. The csv file should include experiment outcome for
SensiBoost, KurtBoost, the ablation tests or baseline such as HQQ and MXQ. The
name experiment, a.k.a. attempt, should follow the pattern
`<method>-<stop>-<top m>`.
This tool is included in the `lm-quant-toolkit` under `data-vis`
folder. A typical usage is demonstrated in the code snippet as follows:
~~~~bash
#!/bin/bash
TOOLKIT_DIR="$HOME/work/lm-quant-toolkit"
$TOOLKIT_DIR/data-vis/plot-win-tie-loss.R -f data/combined.csv
~~~~
[1]: https://huggingface.co/docs/leaderboards/leaderboards/intro
[2]: https://github.com/schnell18/lm-quant-toolkit.git