Helion-V1.5-XL / deployment_guide.md
Trouter-Library's picture
Create deployment_guide.md
531273f verified
# Helion-V1.5-XL Deployment Guide
## Table of Contents
1. [Quick Start](#quick-start)
2. [System Requirements](#system-requirements)
3. [Installation Methods](#installation-methods)
4. [Configuration](#configuration)
5. [Deployment Architectures](#deployment-architectures)
6. [Performance Optimization](#performance-optimization)
7. [Monitoring and Logging](#monitoring-and-logging)
8. [Scaling Strategies](#scaling-strategies)
9. [Security Best Practices](#security-best-practices)
10. [Troubleshooting](#troubleshooting)
11. [Production Checklist](#production-checklist)
---
## Quick Start
### Minimal Setup (5 minutes)
```bash
# Install dependencies
pip install torch>=2.0.0 transformers>=4.35.0 accelerate
# Load and run model
python -c "
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = 'DeepXR/Helion-V1.5-XL'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map='auto'
)
prompt = 'Explain machine learning in simple terms:'
inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
"
```
---
## System Requirements
### Hardware Requirements
#### Minimum Configuration
- **GPU**: NVIDIA GPU with 12GB VRAM (e.g., RTX 3090, RTX 4080)
- **RAM**: 32GB system RAM
- **Storage**: 50GB free space
- **CPU**: 8-core processor (Intel Xeon or AMD EPYC recommended)
- **Precision**: INT4 quantization required
#### Recommended Configuration
- **GPU**: NVIDIA A100 (40GB/80GB) or H100
- **RAM**: 64GB system RAM
- **Storage**: 200GB SSD (NVMe preferred)
- **CPU**: 16+ core processor
- **Network**: 10Gbps for distributed setups
- **Precision**: BF16 for optimal quality
#### Production Configuration
- **GPU**: 2x A100 80GB or 1x H100 80GB
- **RAM**: 128GB+ system RAM
- **Storage**: 500GB NVMe SSD
- **CPU**: 32+ core processor
- **Network**: 25Gbps+ with low latency
- **Redundancy**: Load balancer + multiple replicas
### Software Requirements
```
Operating System: Ubuntu 20.04+, Rocky Linux 8+, or similar
Python: 3.8 - 3.11
CUDA: 11.8 or 12.1+
cuDNN: 8.9+
NVIDIA Driver: 525+
```
### Compatibility Matrix
| Component | Minimum | Recommended | Latest Tested |
|-----------|---------|-------------|---------------|
| PyTorch | 2.0.0 | 2.1.0 | 2.1.2 |
| Transformers | 4.35.0 | 4.36.0 | 4.37.0 |
| CUDA | 11.8 | 12.1 | 12.3 |
| Python | 3.8 | 3.10 | 3.11 |
---
## Installation Methods
### Method 1: Standard Installation
```bash
# Create virtual environment
python -m venv helion-env
source helion-env/bin/activate # On Windows: helion-env\Scripts\activate
# Install dependencies
pip install --upgrade pip
pip install torch==2.1.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.36.0 accelerate==0.24.0 bitsandbytes==0.41.0
# Verify installation
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
python -c "import transformers; print(f'Transformers version: {transformers.__version__}')"
```
### Method 2: Docker Deployment
```dockerfile
# Dockerfile
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04
# Install Python and dependencies
RUN apt-get update && apt-get install -y \
python3.10 \
python3-pip \
git \
&& rm -rf /var/lib/apt/lists/*
# Install PyTorch and transformers
RUN pip3 install torch==2.1.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
RUN pip3 install transformers==4.36.0 accelerate==0.24.0 bitsandbytes==0.41.0
# Copy application code
WORKDIR /app
COPY . /app
# Set environment variables
ENV TRANSFORMERS_CACHE=/app/cache
ENV HF_HOME=/app/cache
# Run inference server
CMD ["python3", "inference_server.py"]
```
```bash
# Build and run
docker build -t helion-v15-xl .
docker run --gpus all -p 8000:8000 helion-v15-xl
```
### Method 3: Kubernetes Deployment
```yaml
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: helion-v15-xl
spec:
replicas: 3
selector:
matchLabels:
app: helion-v15-xl
template:
metadata:
labels:
app: helion-v15-xl
spec:
containers:
- name: helion
image: deepxr/helion-v15-xl:latest
resources:
limits:
nvidia.com/gpu: 1
memory: "64Gi"
cpu: "16"
requests:
nvidia.com/gpu: 1
memory: "48Gi"
cpu: "8"
ports:
- containerPort: 8000
env:
- name: MODEL_ID
value: "DeepXR/Helion-V1.5-XL"
- name: PRECISION
value: "bfloat16"
volumeMounts:
- name: model-cache
mountPath: /cache
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
name: helion-service
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 8000
selector:
app: helion-v15-xl
```
### Method 4: vLLM for Production
```bash
# Install vLLM for optimized serving
pip install vllm
# Run with vLLM
python -m vllm.entrypoints.openai.api_server \
--model DeepXR/Helion-V1.5-XL \
--tensor-parallel-size 1 \
--dtype bfloat16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9
```
---
## Configuration
### Environment Variables
```bash
# Model configuration
export MODEL_ID="DeepXR/Helion-V1.5-XL"
export MODEL_PRECISION="bfloat16"
export MAX_SEQUENCE_LENGTH=8192
export CACHE_DIR="/path/to/cache"
# Performance tuning
export CUDA_VISIBLE_DEVICES=0,1
export OMP_NUM_THREADS=8
export TOKENIZERS_PARALLELISM=true
# Memory optimization
export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:512"
# Logging
export LOG_LEVEL="INFO"
export LOG_FILE="/var/log/helion.log"
```
### Configuration File (config.yaml)
```yaml
model:
model_id: "DeepXR/Helion-V1.5-XL"
precision: "bfloat16"
device_map: "auto"
load_in_4bit: false
load_in_8bit: false
generation:
max_new_tokens: 512
temperature: 0.7
top_p: 0.9
top_k: 50
repetition_penalty: 1.1
do_sample: true
server:
host: "0.0.0.0"
port: 8000
workers: 4
timeout: 120
max_batch_size: 32
cache:
enabled: true
directory: "/tmp/helion_cache"
max_size_gb: 100
safety:
content_filtering: true
pii_detection: true
rate_limiting: true
max_requests_per_minute: 60
monitoring:
enabled: true
metrics_port: 9090
log_level: "INFO"
```
---
## Deployment Architectures
### Architecture 1: Single Instance (Development)
```
┌─────────────┐
│ Client │
└──────┬──────┘
v
┌─────────────┐
│ FastAPI │
│ Server │
└──────┬──────┘
v
┌─────────────┐
│ Model │
│ (1x A100) │
└─────────────┘
```
**Use Case**: Development, testing, low-traffic applications
**Setup**:
```python
# server.py
from fastapi import FastAPI
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
app = FastAPI()
model = AutoModelForCausalLM.from_pretrained(
"DeepXR/Helion-V1.5-XL",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("DeepXR/Helion-V1.5-XL")
@app.post("/generate")
async def generate(prompt: str, max_tokens: int = 512):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=max_tokens)
return {"response": tokenizer.decode(outputs[0], skip_special_tokens=True)}
# Run: uvicorn server:app --host 0.0.0.0 --port 8000
```
### Architecture 2: Load Balanced (Production)
```
┌─────────────┐
│Load Balancer│
└──────┬──────┘
┌──────────────┼──────────────┐
│ │ │
v v v
┌────────┐ ┌────────┐ ┌────────┐
│Instance│ │Instance│ │Instance│
│ 1 │ │ 2 │ │ 3 │
└────────┘ └────────┘ └────────┘
│ │ │
└──────────────┼──────────────┘
v
┌─────────────┐
│ Redis │
│ Cache │
└─────────────┘
```
**Use Case**: Production applications with high availability
### Architecture 3: Distributed Inference (High Throughput)
```
┌──────────────┐
│ API Gateway │
└──────┬───────┘
┌──────┴───────┐
│ Job Scheduler│
└──────┬───────┘
┌──────────────────┼──────────────────┐
│ │ │
v v v
┌─────────┐ ┌─────────┐ ┌─────────┐
│ GPU 0-1 │ │ GPU 2-3 │ │ GPU 4-5 │
│ Tensor │ │ Tensor │ │ Tensor │
│Parallel │ │Parallel │ │Parallel │
└─────────┘ └─────────┘ └─────────┘
```
**Use Case**: Very high throughput, batch processing
**Setup with Ray Serve**:
```python
import ray
from ray import serve
from transformers import AutoModelForCausalLM, AutoTokenizer
ray.init()
serve.start()
@serve.deployment(num_replicas=3, ray_actor_options={"num_gpus": 1})
class HelionModel:
def __init__(self):
self.model = AutoModelForCausalLM.from_pretrained(
"DeepXR/Helion-V1.5-XL",
torch_dtype=torch.bfloat16,
device_map="auto"
)
self.tokenizer = AutoTokenizer.from_pretrained("DeepXR/Helion-V1.5-XL")
async def __call__(self, request):
prompt = await request.json()
inputs = self.tokenizer(prompt["text"], return_tensors="pt").to(self.model.device)
outputs = self.model.generate(**inputs, max_new_tokens=512)
return {"response": self.tokenizer.decode(outputs[0], skip_special_tokens=True)}
HelionModel.deploy()
```
---
## Performance Optimization
### 1. Quantization
```python
# 8-bit Quantization
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0
)
model = AutoModelForCausalLM.from_pretrained(
"DeepXR/Helion-V1.5-XL",
quantization_config=quantization_config,
device_map="auto"
)
# 4-bit Quantization (Maximum memory savings)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
```
### 2. Flash Attention
```python
# Enable Flash Attention 2
model = AutoModelForCausalLM.from_pretrained(
"DeepXR/Helion-V1.5-XL",
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2"
)
```
### 3. Compilation with torch.compile
```python
# Compile model for faster inference (PyTorch 2.0+)
model = torch.compile(model, mode="reduce-overhead")
```
### 4. KV Cache Optimization
```python
# Use cache for faster generation
outputs = model.generate(
**inputs,
max_new_tokens=512,
use_cache=True,
past_key_values=past_key_values # Reuse from previous generation
)
```
### 5. Batching
```python
# Process multiple prompts in batch
prompts = ["Prompt 1", "Prompt 2", "Prompt 3"]
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
# Decode all outputs
responses = [tokenizer.decode(out, skip_special_tokens=True) for out in outputs]
```
### Performance Benchmarks by Configuration
| Configuration | Tokens/sec | Latency (ms) | Memory (GB) | Cost Efficiency |
|---------------|------------|--------------|-------------|-----------------|
| A100 BF16 | 47.3 | 21.1 | 34.2 | Baseline |
| A100 INT8 | 89.6 | 11.2 | 17.8 | 1.9x faster |
| A100 INT4 | 134.2 | 7.5 | 10.4 | 2.8x faster |
| H100 BF16 | 78.1 | 12.8 | 34.2 | 1.65x faster |
| H100 INT4 | 218.7 | 4.6 | 10.4 | 4.6x faster |
---
## Monitoring and Logging
### Prometheus Metrics
```python
from prometheus_client import Counter, Histogram, Gauge, start_http_server
# Metrics
request_count = Counter('helion_requests_total', 'Total requests')
request_duration = Histogram('helion_request_duration_seconds', 'Request duration')
active_requests = Gauge('helion_active_requests', 'Active requests')
token_count = Counter('helion_tokens_generated', 'Tokens generated')
error_count = Counter('helion_errors_total', 'Total errors', ['error_type'])
# Start metrics server
start_http_server(9090)
```
### Structured Logging
```python
import logging
import json
from datetime import datetime
class JSONFormatter(logging.Formatter):
def format(self, record):
log_data = {
"timestamp": datetime.utcnow().isoformat(),
"level": record.levelname,
"message": record.getMessage(),
"module": record.module,
"function": record.funcName,
"line": record.lineno
}
return json.dumps(log_data)
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger = logging.getLogger()
logger.addHandler(handler)
logger.setLevel(logging.INFO)
```
### Health Check Endpoint
```python
@app.get("/health")
async def health_check():
try:
# Check model is loaded
assert model is not None
# Check GPU is available
assert torch.cuda.is_available()
# Quick inference test
test_input = tokenizer("test", return_tensors="pt").to(model.device)
_ = model.generate(**test_input, max_new_tokens=1)
return {"status": "healthy", "timestamp": datetime.utcnow().isoformat()}
except Exception as e:
return {"status": "unhealthy", "error": str(e)}, 503
```
### Grafana Dashboard Configuration
```json
{
"dashboard": {
"title": "Helion-V1.5-XL Monitoring",
"panels": [
{
"title": "Requests per Second",
"targets": [{"expr": "rate(helion_requests_total[1m])"}]
},
{
"title": "Average Latency",
"targets": [{"expr": "rate(helion_request_duration_seconds_sum[5m]) / rate(helion_request_duration_seconds_count[5m])"}]
},
{
"title": "GPU Utilization",
"targets": [{"expr": "nvidia_gpu_utilization"}]
},
{
"title": "GPU Memory Usage",
"targets": [{"expr": "nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes * 100"}]
}
]
}
}
```
---
## Scaling Strategies
### Horizontal Scaling
```bash
# Using Kubernetes HPA
kubectl autoscale deployment helion-v15-xl \
--min=2 \
--max=10 \
--cpu-percent=70 \
--memory-percent=80
```
### Vertical Scaling
| Traffic Level | Configuration | Instances |
|---------------|---------------|-----------|
| Low (< 10 req/s) | 1x A100 40GB, INT8 | 1 |
| Medium (10-50 req/s) | 1x A100 80GB, BF16 | 2-3 |
| High (50-200 req/s) | 2x A100 80GB, BF16 | 4-6 |
| Very High (200+ req/s) | Multiple H100 clusters | 10+ |
### Request Queuing
```python
from asyncio import Queue, create_task
import asyncio
request_queue = Queue(maxsize=100)
batch_size = 8
async def batch_processor():
while True:
batch = []
for _ in range(batch_size):
try:
item = await asyncio.wait_for(request_queue.get(), timeout=0.1)
batch.append(item)
except asyncio.TimeoutError:
break
if batch:
# Process batch
prompts = [item["prompt"] for item in batch]
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
# Return results
for item, output in zip(batch, outputs):
item["future"].set_result(tokenizer.decode(output, skip_special_tokens=True))
# Start background task
create_task(batch_processor())
```
---
## Security Best Practices
### 1. API Authentication
```python
from fastapi import HTTPException, Security
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
security = HTTPBearer()
async def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)):
if credentials.credentials != os.getenv("API_TOKEN"):
raise HTTPException(status_code=401, detail="Invalid authentication")
return credentials.credentials
@app.post("/generate")
async def generate(prompt: str, token: str = Security(verify_token)):
# Process request
pass
```
### 2. Rate Limiting
```python
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(429, _rate_limit_exceeded_handler)
@app.post("/generate")
@limiter.limit("60/minute")
async def generate(request: Request, prompt: str):
# Process request
pass
```
### 3. Input Validation
```python
from pydantic import BaseModel, Field, validator
class GenerationRequest(BaseModel):
prompt: str = Field(..., min_length=1, max_length=8000)
max_tokens: int = Field(512, ge=1, le=2048)
temperature: float = Field(0.7, ge=0.0, le=2.0)
@validator('prompt')
def validate_prompt(cls, v):
# Check for malicious content
if any(bad in v.lower() for bad in ['<script>', 'DROP TABLE']):
raise ValueError('Invalid prompt content')
return v
```
### 4. Content Filtering Integration
```python
from safeguard_filters import ContentSafetyFilter, RefusalGenerator
safety_filter = ContentSafetyFilter()
refusal_gen = RefusalGenerator()
@app.post("/generate")
async def generate(request: GenerationRequest):
# Check input safety
is_safe, violations = safety_filter.check_input(request.prompt)
if not is_safe:
return {"error": refusal_gen.generate_refusal(violations[0])}
# Generate response
outputs = model.generate(...)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Check output safety
is_safe, violations = safety_filter.check_output(response)
if not is_safe:
response = safety_filter.redact_pii(response)
return {"response": response}
```
---
## Troubleshooting
### Common Issues and Solutions
#### Issue 1: Out of Memory (OOM)
**Symptoms**: CUDA out of memory error
**Solutions**:
```python
# Solution 1: Use quantization
model = AutoModelForCausalLM.from_pretrained(
model_id,
load_in_8bit=True, # or load_in_4bit=True
device_map="auto"
)
# Solution 2: Reduce batch size
# Use batch_size=1 for inference
# Solution 3: Reduce context length
outputs = model.generate(**inputs, max_new_tokens=256) # Instead of 512
# Solution 4: Clear cache
torch.cuda.empty_cache()
```
#### Issue 2: Slow Inference
**Symptoms**: High latency, low throughput
**Solutions**:
```python
# Solution 1: Enable Flash Attention
model = AutoModelForCausalLM.from_pretrained(
model_id,
attn_implementation="flash_attention_2"
)
# Solution 2: Use compilation
model = torch.compile(model)
# Solution 3: Use vLLM
# Install: pip install vllm
# Run with vLLM server (much faster)
# Solution 4: Batch requests
# Process multiple requests together
```
#### Issue 3: Model Not Loading
**Symptoms**: Download errors, corruption
**Solutions**:
```bash
# Clear cache
rm -rf ~/.cache/huggingface/
# Download manually
huggingface-cli download DeepXR/Helion-V1.5-XL
# Check disk space
df -h
# Verify CUDA installation
nvidia-smi
```
#### Issue 4: Quality Degradation with Quantization
**Solutions**:
- Use INT8 instead of INT4
- Calibrate quantization with representative data
- Use double quantization: `bnb_4bit_use_double_quant=True`
### Debugging Commands
```bash
# Check GPU status
nvidia-smi
# Monitor GPU usage
watch -n 1 nvidia-smi
# Check Python packages
pip list | grep -E "torch|transformers"
# Test CUDA
python -c "import torch; print(torch.cuda.is_available())"
# Memory profiling
python -m memory_profiler your_script.py
# Performance profiling
python -m cProfile -o output.prof your_script.py
```
---
## Production Checklist
### Pre-Deployment
- [ ] Hardware requirements verified
- [ ] Dependencies installed and tested
- [ ] Model downloaded and loaded successfully
- [ ] Inference tested with sample prompts
- [ ] Performance benchmarks meet requirements
- [ ] Memory usage within acceptable limits
- [ ] Safety filters configured and tested
- [ ] API authentication implemented
- [ ] Rate limiting configured
- [ ] Input validation in place
- [ ] Error handling implemented
- [ ] Logging configured
- [ ] Monitoring dashboards set up
- [ ] Health check endpoints working
- [ ] Load testing completed
- [ ] Security audit passed
- [ ] Documentation complete
### Post-Deployment
- [ ] Monitor error rates
- [ ] Track latency metrics
- [ ] Monitor GPU utilization
- [ ] Check memory usage trends
- [ ] Review safety violation logs
- [ ] Analyze user feedback
- [ ] Update model if needed
- [ ] Scale based on load
- [ ] Regular security updates
- [ ] Backup configurations
- [ ] Disaster recovery tested
- [ ] Performance optimization ongoing
### Maintenance Schedule
| Task | Frequency | Responsibility |
|------|-----------|----------------|
| Check error logs | Daily | DevOps |
| Review performance metrics | Daily | ML Engineers |
| Security updates | Weekly | Security Team |
| Model evaluation | Monthly | Data Science |
| Capacity planning | Monthly | Infrastructure |
| Disaster recovery drill | Quarterly | All Teams |
| Full system audit | Annually | External Auditor |
---
## Additional Resources
### Documentation
- [Transformers Documentation](https://huggingface.co/docs/transformers)
- [PyTorch Documentation](https://pytorch.org/docs)
- [CUDA Programming Guide](https://docs.nvidia.com/cuda/)
### Support Channels
- GitHub Issues: For bug reports and feature requests
- Community Forum: For general questions and discussions
- Enterprise Support: For production deployments
### Example Projects
- REST API Server: `/examples/rest_api`
- Streaming Interface: `/examples/streaming`
- Batch Processing: `/examples/batch_processing`
- Fine-tuning: `/examples/fine_tuning`
---
## Version History
| Version | Date | Changes |
|---------|------|---------|
| 1.0.0 | 2024-11-01 | Initial release |
| 1.0.1 | 2024-11-15 | Performance optimizations |
| 1.1.0 | 2024-12-01 | Flash Attention 2 support |
---
**Last Updated**: 2024-11-10
**Maintained By**: DeepXR Engineering Team