Spaces:

HathoraResearch
/

LLM-KV-cache-calculator

Running

LLM-KV-cache-calculator / README.md

Fix short_description length limit

fb095c2 3 months ago

1.16 kB

	---
	title: LLM KV Cache Calculator
	emoji: 💻
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 5.45.0
	app_file: app.py
	pinned: false
	short_description: Calculate KV cache memory requirements for LLMs
	---

	# KV Cache Calculator

	Calculate KV cache memory requirements for transformer models.

	## Credits

	This implementation is derived from and builds upon the excellent work by [gaunernst](https://huggingface.co/spaces/gaunernst/kv-cache-calculator). Special thanks for the original implementation!

	## Features

	- Multi-attention support: MHA (Multi-Head Attention), GQA (Grouped Query Attention), and MLA (Multi-head Latent Attention)
	- Multiple data types: fp16/bf16, fp8, and fp4 quantization
	- Real-time calculation: Instant memory requirement estimates
	- Model analysis: Detailed breakdown of model configuration
	- Universal compatibility: Works with any HuggingFace transformer model

	## Usage

	1. Enter your model ID (e.g., "Qwen/Qwen3-30B-A3B")
	2. Set context length and number of users
	3. Choose data type precision
	4. Add HuggingFace token if needed for gated models
	5. Click calculate to get memory requirements