deployment_guide.md · DeepXR/Helion-V1.5-XL at main

Helion-V1.5-XL / deployment_guide.md

Trouter-Library

Create deployment_guide.md

531273f verified about 2 months ago

preview code

raw

history blame contribute delete

23.7 kB

	# Helion-V1.5-XL Deployment Guide

	## Table of Contents

	1. [Quick Start](#quick-start)
	2. [System Requirements](#system-requirements)
	3. [Installation Methods](#installation-methods)
	4. [Configuration](#configuration)
	5. [Deployment Architectures](#deployment-architectures)
	6. [Performance Optimization](#performance-optimization)
	7. [Monitoring and Logging](#monitoring-and-logging)
	8. [Scaling Strategies](#scaling-strategies)
	9. [Security Best Practices](#security-best-practices)
	10. [Troubleshooting](#troubleshooting)
	11. [Production Checklist](#production-checklist)

	---

	## Quick Start

	### Minimal Setup (5 minutes)

	```bash
	# Install dependencies
	pip install torch>=2.0.0 transformers>=4.35.0 accelerate

	# Load and run model
	python -c "
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	model_id = 'DeepXR/Helion-V1.5-XL'
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map='auto'
	)

	prompt = 'Explain machine learning in simple terms:'
	inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=256)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	"
	```

	---

	## System Requirements

	### Hardware Requirements

	#### Minimum Configuration
	- GPU: NVIDIA GPU with 12GB VRAM (e.g., RTX 3090, RTX 4080)
	- RAM: 32GB system RAM
	- Storage: 50GB free space
	- CPU: 8-core processor (Intel Xeon or AMD EPYC recommended)
	- Precision: INT4 quantization required

	#### Recommended Configuration
	- GPU: NVIDIA A100 (40GB/80GB) or H100
	- RAM: 64GB system RAM
	- Storage: 200GB SSD (NVMe preferred)
	- CPU: 16+ core processor
	- Network: 10Gbps for distributed setups
	- Precision: BF16 for optimal quality

	#### Production Configuration
	- GPU: 2x A100 80GB or 1x H100 80GB
	- RAM: 128GB+ system RAM
	- Storage: 500GB NVMe SSD
	- CPU: 32+ core processor
	- Network: 25Gbps+ with low latency
	- Redundancy: Load balancer + multiple replicas

	### Software Requirements

	```
	Operating System: Ubuntu 20.04+, Rocky Linux 8+, or similar
	Python: 3.8 - 3.11
	CUDA: 11.8 or 12.1+
	cuDNN: 8.9+
	NVIDIA Driver: 525+
	```

	### Compatibility Matrix

	\| Component \| Minimum \| Recommended \| Latest Tested \|
	\|-----------\|---------\|-------------\|---------------\|
	\| PyTorch \| 2.0.0 \| 2.1.0 \| 2.1.2 \|
	\| Transformers \| 4.35.0 \| 4.36.0 \| 4.37.0 \|
	\| CUDA \| 11.8 \| 12.1 \| 12.3 \|
	\| Python \| 3.8 \| 3.10 \| 3.11 \|

	---

	## Installation Methods

	### Method 1: Standard Installation

	```bash
	# Create virtual environment
	python -m venv helion-env
	source helion-env/bin/activate # On Windows: helion-env\Scripts\activate

	# Install dependencies
	pip install --upgrade pip
	pip install torch==2.1.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
	pip install transformers==4.36.0 accelerate==0.24.0 bitsandbytes==0.41.0

	# Verify installation
	python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
	python -c "import transformers; print(f'Transformers version: {transformers.__version__}')"
	```

	### Method 2: Docker Deployment

	```dockerfile
	# Dockerfile
	FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04

	# Install Python and dependencies
	RUN apt-get update && apt-get install -y \
	python3.10 \
	python3-pip \
	git \
	&& rm -rf /var/lib/apt/lists/*

	# Install PyTorch and transformers
	RUN pip3 install torch==2.1.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
	RUN pip3 install transformers==4.36.0 accelerate==0.24.0 bitsandbytes==0.41.0

	# Copy application code
	WORKDIR /app
	COPY . /app

	# Set environment variables
	ENV TRANSFORMERS_CACHE=/app/cache
	ENV HF_HOME=/app/cache

	# Run inference server
	CMD ["python3", "inference_server.py"]
	```

	```bash
	# Build and run
	docker build -t helion-v15-xl .
	docker run --gpus all -p 8000:8000 helion-v15-xl
	```

	### Method 3: Kubernetes Deployment

	```yaml
	# deployment.yaml
	apiVersion: apps/v1
	kind: Deployment
	metadata:
	name: helion-v15-xl
	spec:
	replicas: 3
	selector:
	matchLabels:
	app: helion-v15-xl
	template:
	metadata:
	labels:
	app: helion-v15-xl
	spec:
	containers:
	- name: helion
	image: deepxr/helion-v15-xl:latest
	resources:
	limits:
	nvidia.com/gpu: 1
	memory: "64Gi"
	cpu: "16"
	requests:
	nvidia.com/gpu: 1
	memory: "48Gi"
	cpu: "8"
	ports:
	- containerPort: 8000
	env:
	- name: MODEL_ID
	value: "DeepXR/Helion-V1.5-XL"
	- name: PRECISION
	value: "bfloat16"
	volumeMounts:
	- name: model-cache
	mountPath: /cache
	volumes:
	- name: model-cache
	persistentVolumeClaim:
	claimName: model-cache-pvc
	---
	apiVersion: v1
	kind: Service
	metadata:
	name: helion-service
	spec:
	type: LoadBalancer
	ports:
	- port: 80
	targetPort: 8000
	selector:
	app: helion-v15-xl
	```

	### Method 4: vLLM for Production

	```bash
	# Install vLLM for optimized serving
	pip install vllm

	# Run with vLLM
	python -m vllm.entrypoints.openai.api_server \
	--model DeepXR/Helion-V1.5-XL \
	--tensor-parallel-size 1 \
	--dtype bfloat16 \
	--max-model-len 8192 \
	--gpu-memory-utilization 0.9
	```

	---

	## Configuration

	### Environment Variables

	```bash
	# Model configuration
	export MODEL_ID="DeepXR/Helion-V1.5-XL"
	export MODEL_PRECISION="bfloat16"
	export MAX_SEQUENCE_LENGTH=8192
	export CACHE_DIR="/path/to/cache"

	# Performance tuning
	export CUDA_VISIBLE_DEVICES=0,1
	export OMP_NUM_THREADS=8
	export TOKENIZERS_PARALLELISM=true

	# Memory optimization
	export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:512"

	# Logging
	export LOG_LEVEL="INFO"
	export LOG_FILE="/var/log/helion.log"
	```

	### Configuration File (config.yaml)

	```yaml
	model:
	model_id: "DeepXR/Helion-V1.5-XL"
	precision: "bfloat16"
	device_map: "auto"
	load_in_4bit: false
	load_in_8bit: false

	generation:
	max_new_tokens: 512
	temperature: 0.7
	top_p: 0.9
	top_k: 50
	repetition_penalty: 1.1
	do_sample: true

	server:
	host: "0.0.0.0"
	port: 8000
	workers: 4
	timeout: 120
	max_batch_size: 32

	cache:
	enabled: true
	directory: "/tmp/helion_cache"
	max_size_gb: 100

	safety:
	content_filtering: true
	pii_detection: true
	rate_limiting: true
	max_requests_per_minute: 60

	monitoring:
	enabled: true
	metrics_port: 9090
	log_level: "INFO"
	```

	---

	## Deployment Architectures

	### Architecture 1: Single Instance (Development)

	```
	┌─────────────┐
	│ Client │
	└──────┬──────┘
	│
	v
	┌─────────────┐
	│ FastAPI │
	│ Server │
	└──────┬──────┘
	│
	v
	┌─────────────┐
	│ Model │
	│ (1x A100) │
	└─────────────┘
	```

	Use Case: Development, testing, low-traffic applications

	Setup:
	```python
	# server.py
	from fastapi import FastAPI
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	app = FastAPI()
	model = AutoModelForCausalLM.from_pretrained(
	"DeepXR/Helion-V1.5-XL",
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained("DeepXR/Helion-V1.5-XL")

	@app.post("/generate")
	async def generate(prompt: str, max_tokens: int = 512):
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=max_tokens)
	return {"response": tokenizer.decode(outputs[0], skip_special_tokens=True)}

	# Run: uvicorn server:app --host 0.0.0.0 --port 8000
	```

	### Architecture 2: Load Balanced (Production)

	```
	┌─────────────┐
	│Load Balancer│
	└──────┬──────┘
	│
	┌──────────────┼──────────────┐
	│ │ │
	v v v
	┌────────┐ ┌────────┐ ┌────────┐
	│Instance│ │Instance│ │Instance│
	│ 1 │ │ 2 │ │ 3 │
	└────────┘ └────────┘ └────────┘
	│ │ │
	└──────────────┼──────────────┘
	│
	v
	┌─────────────┐
	│ Redis │
	│ Cache │
	└─────────────┘
	```

	Use Case: Production applications with high availability

	### Architecture 3: Distributed Inference (High Throughput)

	```
	┌──────────────┐
	│ API Gateway │
	└──────┬───────┘
	│
	┌──────┴───────┐
	│ Job Scheduler│
	└──────┬───────┘
	│
	┌──────────────────┼──────────────────┐
	│ │ │
	v v v
	┌─────────┐ ┌─────────┐ ┌─────────┐
	│ GPU 0-1 │ │ GPU 2-3 │ │ GPU 4-5 │
	│ Tensor │ │ Tensor │ │ Tensor │
	│Parallel │ │Parallel │ │Parallel │
	└─────────┘ └─────────┘ └─────────┘
	```

	Use Case: Very high throughput, batch processing

	Setup with Ray Serve:
	```python
	import ray
	from ray import serve
	from transformers import AutoModelForCausalLM, AutoTokenizer

	ray.init()
	serve.start()

	@serve.deployment(num_replicas=3, ray_actor_options={"num_gpus": 1})
	class HelionModel:
	def __init__(self):
	self.model = AutoModelForCausalLM.from_pretrained(
	"DeepXR/Helion-V1.5-XL",
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)
	self.tokenizer = AutoTokenizer.from_pretrained("DeepXR/Helion-V1.5-XL")

	async def __call__(self, request):
	prompt = await request.json()
	inputs = self.tokenizer(prompt["text"], return_tensors="pt").to(self.model.device)
	outputs = self.model.generate(**inputs, max_new_tokens=512)
	return {"response": self.tokenizer.decode(outputs[0], skip_special_tokens=True)}

	HelionModel.deploy()
	```

	---

	## Performance Optimization

	### 1. Quantization

	```python
	# 8-bit Quantization
	from transformers import BitsAndBytesConfig

	quantization_config = BitsAndBytesConfig(
	load_in_8bit=True,
	llm_int8_threshold=6.0
	)

	model = AutoModelForCausalLM.from_pretrained(
	"DeepXR/Helion-V1.5-XL",
	quantization_config=quantization_config,
	device_map="auto"
	)

	# 4-bit Quantization (Maximum memory savings)
	quantization_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_compute_dtype=torch.bfloat16,
	bnb_4bit_use_double_quant=True,
	bnb_4bit_quant_type="nf4"
	)
	```

	### 2. Flash Attention

	```python
	# Enable Flash Attention 2
	model = AutoModelForCausalLM.from_pretrained(
	"DeepXR/Helion-V1.5-XL",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	attn_implementation="flash_attention_2"
	)
	```

	### 3. Compilation with torch.compile

	```python
	# Compile model for faster inference (PyTorch 2.0+)
	model = torch.compile(model, mode="reduce-overhead")
	```

	### 4. KV Cache Optimization

	```python
	# Use cache for faster generation
	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	use_cache=True,
	past_key_values=past_key_values # Reuse from previous generation
	)
	```

	### 5. Batching

	```python
	# Process multiple prompts in batch
	prompts = ["Prompt 1", "Prompt 2", "Prompt 3"]
	inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=256)

	# Decode all outputs
	responses = [tokenizer.decode(out, skip_special_tokens=True) for out in outputs]
	```

	### Performance Benchmarks by Configuration

	\| Configuration \| Tokens/sec \| Latency (ms) \| Memory (GB) \| Cost Efficiency \|
	\|---------------\|------------\|--------------\|-------------\|-----------------\|
	\| A100 BF16 \| 47.3 \| 21.1 \| 34.2 \| Baseline \|
	\| A100 INT8 \| 89.6 \| 11.2 \| 17.8 \| 1.9x faster \|
	\| A100 INT4 \| 134.2 \| 7.5 \| 10.4 \| 2.8x faster \|
	\| H100 BF16 \| 78.1 \| 12.8 \| 34.2 \| 1.65x faster \|
	\| H100 INT4 \| 218.7 \| 4.6 \| 10.4 \| 4.6x faster \|

	---

	## Monitoring and Logging

	### Prometheus Metrics

	```python
	from prometheus_client import Counter, Histogram, Gauge, start_http_server

	# Metrics
	request_count = Counter('helion_requests_total', 'Total requests')
	request_duration = Histogram('helion_request_duration_seconds', 'Request duration')
	active_requests = Gauge('helion_active_requests', 'Active requests')
	token_count = Counter('helion_tokens_generated', 'Tokens generated')
	error_count = Counter('helion_errors_total', 'Total errors', ['error_type'])

	# Start metrics server
	start_http_server(9090)
	```

	### Structured Logging

	```python
	import logging
	import json
	from datetime import datetime

	class JSONFormatter(logging.Formatter):
	def format(self, record):
	log_data = {
	"timestamp": datetime.utcnow().isoformat(),
	"level": record.levelname,
	"message": record.getMessage(),
	"module": record.module,
	"function": record.funcName,
	"line": record.lineno
	}
	return json.dumps(log_data)

	handler = logging.StreamHandler()
	handler.setFormatter(JSONFormatter())
	logger = logging.getLogger()
	logger.addHandler(handler)
	logger.setLevel(logging.INFO)
	```

	### Health Check Endpoint

	```python
	@app.get("/health")
	async def health_check():
	try:
	# Check model is loaded
	assert model is not None
	# Check GPU is available
	assert torch.cuda.is_available()
	# Quick inference test
	test_input = tokenizer("test", return_tensors="pt").to(model.device)
	_ = model.generate(**test_input, max_new_tokens=1)
	return {"status": "healthy", "timestamp": datetime.utcnow().isoformat()}
	except Exception as e:
	return {"status": "unhealthy", "error": str(e)}, 503
	```

	### Grafana Dashboard Configuration

	```json
	{
	"dashboard": {
	"title": "Helion-V1.5-XL Monitoring",
	"panels": [
	{
	"title": "Requests per Second",
	"targets": [{"expr": "rate(helion_requests_total[1m])"}]
	},
	{
	"title": "Average Latency",
	"targets": [{"expr": "rate(helion_request_duration_seconds_sum[5m]) / rate(helion_request_duration_seconds_count[5m])"}]
	},
	{
	"title": "GPU Utilization",
	"targets": [{"expr": "nvidia_gpu_utilization"}]
	},
	{
	"title": "GPU Memory Usage",
	"targets": [{"expr": "nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes * 100"}]
	}
	]
	}
	}
	```

	---

	## Scaling Strategies

	### Horizontal Scaling

	```bash
	# Using Kubernetes HPA
	kubectl autoscale deployment helion-v15-xl \
	--min=2 \
	--max=10 \
	--cpu-percent=70 \
	--memory-percent=80
	```

	### Vertical Scaling

	\| Traffic Level \| Configuration \| Instances \|
	\|---------------\|---------------\|-----------\|
	\| Low (< 10 req/s) \| 1x A100 40GB, INT8 \| 1 \|
	\| Medium (10-50 req/s) \| 1x A100 80GB, BF16 \| 2-3 \|
	\| High (50-200 req/s) \| 2x A100 80GB, BF16 \| 4-6 \|
	\| Very High (200+ req/s) \| Multiple H100 clusters \| 10+ \|

	### Request Queuing

	```python
	from asyncio import Queue, create_task
	import asyncio

	request_queue = Queue(maxsize=100)
	batch_size = 8

	async def batch_processor():
	while True:
	batch = []
	for _ in range(batch_size):
	try:
	item = await asyncio.wait_for(request_queue.get(), timeout=0.1)
	batch.append(item)
	except asyncio.TimeoutError:
	break

	if batch:
	# Process batch
	prompts = [item["prompt"] for item in batch]
	inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=256)

	# Return results
	for item, output in zip(batch, outputs):
	item["future"].set_result(tokenizer.decode(output, skip_special_tokens=True))

	# Start background task
	create_task(batch_processor())
	```

	---

	## Security Best Practices

	### 1. API Authentication

	```python
	from fastapi import HTTPException, Security
	from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials

	security = HTTPBearer()

	async def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)):
	if credentials.credentials != os.getenv("API_TOKEN"):
	raise HTTPException(status_code=401, detail="Invalid authentication")
	return credentials.credentials

	@app.post("/generate")
	async def generate(prompt: str, token: str = Security(verify_token)):
	# Process request
	pass
	```

	### 2. Rate Limiting

	```python
	from slowapi import Limiter, _rate_limit_exceeded_handler
	from slowapi.util import get_remote_address

	limiter = Limiter(key_func=get_remote_address)
	app.state.limiter = limiter
	app.add_exception_handler(429, _rate_limit_exceeded_handler)

	@app.post("/generate")
	@limiter.limit("60/minute")
	async def generate(request: Request, prompt: str):
	# Process request
	pass
	```

	### 3. Input Validation

	```python
	from pydantic import BaseModel, Field, validator

	class GenerationRequest(BaseModel):
	prompt: str = Field(..., min_length=1, max_length=8000)
	max_tokens: int = Field(512, ge=1, le=2048)
	temperature: float = Field(0.7, ge=0.0, le=2.0)

	@validator('prompt')
	def validate_prompt(cls, v):
	# Check for malicious content
	if any(bad in v.lower() for bad in ['<script>', 'DROP TABLE']):
	raise ValueError('Invalid prompt content')
	return v
	```

	### 4. Content Filtering Integration

	```python
	from safeguard_filters import ContentSafetyFilter, RefusalGenerator

	safety_filter = ContentSafetyFilter()
	refusal_gen = RefusalGenerator()

	@app.post("/generate")
	async def generate(request: GenerationRequest):
	# Check input safety
	is_safe, violations = safety_filter.check_input(request.prompt)
	if not is_safe:
	return {"error": refusal_gen.generate_refusal(violations[0])}

	# Generate response
	outputs = model.generate(...)
	response = tokenizer.decode(outputs[0], skip_special_tokens=True)

	# Check output safety
	is_safe, violations = safety_filter.check_output(response)
	if not is_safe:
	response = safety_filter.redact_pii(response)

	return {"response": response}
	```

	---

	## Troubleshooting

	### Common Issues and Solutions

	#### Issue 1: Out of Memory (OOM)

	Symptoms: CUDA out of memory error

	Solutions:
	```python
	# Solution 1: Use quantization
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	load_in_8bit=True, # or load_in_4bit=True
	device_map="auto"
	)

	# Solution 2: Reduce batch size
	# Use batch_size=1 for inference

	# Solution 3: Reduce context length
	outputs = model.generate(**inputs, max_new_tokens=256) # Instead of 512

	# Solution 4: Clear cache
	torch.cuda.empty_cache()
	```

	#### Issue 2: Slow Inference

	Symptoms: High latency, low throughput

	Solutions:
	```python
	# Solution 1: Enable Flash Attention
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	attn_implementation="flash_attention_2"
	)

	# Solution 2: Use compilation
	model = torch.compile(model)

	# Solution 3: Use vLLM
	# Install: pip install vllm
	# Run with vLLM server (much faster)

	# Solution 4: Batch requests
	# Process multiple requests together
	```

	#### Issue 3: Model Not Loading

	Symptoms: Download errors, corruption

	Solutions:
	```bash
	# Clear cache
	rm -rf ~/.cache/huggingface/

	# Download manually
	huggingface-cli download DeepXR/Helion-V1.5-XL

	# Check disk space
	df -h

	# Verify CUDA installation
	nvidia-smi
	```

	#### Issue 4: Quality Degradation with Quantization

	Solutions:
	- Use INT8 instead of INT4
	- Calibrate quantization with representative data
	- Use double quantization: `bnb_4bit_use_double_quant=True`

	### Debugging Commands

	```bash
	# Check GPU status
	nvidia-smi

	# Monitor GPU usage
	watch -n 1 nvidia-smi

	# Check Python packages
	pip list \| grep -E "torch\|transformers"

	# Test CUDA
	python -c "import torch; print(torch.cuda.is_available())"

	# Memory profiling
	python -m memory_profiler your_script.py

	# Performance profiling
	python -m cProfile -o output.prof your_script.py
	```

	---

	## Production Checklist

	### Pre-Deployment

	- [ ] Hardware requirements verified
	- [ ] Dependencies installed and tested
	- [ ] Model downloaded and loaded successfully
	- [ ] Inference tested with sample prompts
	- [ ] Performance benchmarks meet requirements
	- [ ] Memory usage within acceptable limits
	- [ ] Safety filters configured and tested
	- [ ] API authentication implemented
	- [ ] Rate limiting configured
	- [ ] Input validation in place
	- [ ] Error handling implemented
	- [ ] Logging configured
	- [ ] Monitoring dashboards set up
	- [ ] Health check endpoints working
	- [ ] Load testing completed
	- [ ] Security audit passed
	- [ ] Documentation complete

	### Post-Deployment

	- [ ] Monitor error rates
	- [ ] Track latency metrics
	- [ ] Monitor GPU utilization
	- [ ] Check memory usage trends
	- [ ] Review safety violation logs
	- [ ] Analyze user feedback
	- [ ] Update model if needed
	- [ ] Scale based on load
	- [ ] Regular security updates
	- [ ] Backup configurations
	- [ ] Disaster recovery tested
	- [ ] Performance optimization ongoing

	### Maintenance Schedule

	\| Task \| Frequency \| Responsibility \|
	\|------\|-----------\|----------------\|
	\| Check error logs \| Daily \| DevOps \|
	\| Review performance metrics \| Daily \| ML Engineers \|
	\| Security updates \| Weekly \| Security Team \|
	\| Model evaluation \| Monthly \| Data Science \|
	\| Capacity planning \| Monthly \| Infrastructure \|
	\| Disaster recovery drill \| Quarterly \| All Teams \|
	\| Full system audit \| Annually \| External Auditor \|

	---

	## Additional Resources

	### Documentation
	- [Transformers Documentation](https://huggingface.co/docs/transformers)
	- [PyTorch Documentation](https://pytorch.org/docs)
	- [CUDA Programming Guide](https://docs.nvidia.com/cuda/)

	### Support Channels
	- GitHub Issues: For bug reports and feature requests
	- Community Forum: For general questions and discussions
	- Enterprise Support: For production deployments

	### Example Projects
	- REST API Server: `/examples/rest_api`
	- Streaming Interface: `/examples/streaming`
	- Batch Processing: `/examples/batch_processing`
	- Fine-tuning: `/examples/fine_tuning`

	---

	## Version History

	\| Version \| Date \| Changes \|
	\|---------\|------\|---------\|
	\| 1.0.0 \| 2024-11-01 \| Initial release \|
	\| 1.0.1 \| 2024-11-15 \| Performance optimizations \|
	\| 1.1.0 \| 2024-12-01 \| Flash Attention 2 support \|

	---

	Last Updated: 2024-11-10

	Maintained By: DeepXR Engineering Team