Minibase's picture
Upload USAGE.md with huggingface_hub
2adb798 verified

Usage Examples - Detoxify-Small

Basic Usage

1. Start the Server

./run_server.sh

2. Check Server Health

curl http://127.0.0.1:8000/health

3. Simple Completion

curl -X POST http://127.0.0.1:8000/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This is terrible!\n\nResponse: ",
    "max_tokens": 100,
    "temperature": 0.7
  }'

4. Streaming Response

curl -X POST http://127.0.0.1:8000/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This sucks so bad!\n\nResponse: ",
    "max_tokens": 500,
    "temperature": 0.8,
    "stream": true
  }'

Advanced Configuration

Custom Server Settings

llama-server \
  -m model.gguf \
  --host 127.0.0.1 \
  --port 8000 \
  --n-gpu-layers 35 \
  --ctx-size 4096 \
  --threads 8 \
  --chat-template "" \
  --log-disable

GPU Acceleration (macOS with Metal)

llama-server \
  -m model.gguf \
  --host 127.0.0.1 \
  --port 8000 \
  --n-gpu-layers 50 \
  --metal

GPU Acceleration (Linux/Windows with CUDA)

llama-server \
  -m model.gguf \
  --host 127.0.0.1 \
  --port 8000 \
  --n-gpu-layers 50 \
  --cuda

Python Client Example

import requests
import json

def complete_with_model(prompt, max_tokens=200, temperature=0.7):
    url = "http://127.0.0.1:8000/completion"

    payload = {
        "prompt": prompt,
        "max_tokens": max_tokens,
        "temperature": temperature
    }

    headers = {
        'Content-Type': 'application/json'
    }

    response = requests.post(url, json=payload, headers=headers)

    if response.status_code == 200:
        result = response.json()
        return result['content']
    else:
        return f"Error: {response.status_code}"

# Example usage
prompt = "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This is awful!\n\nResponse: "
response = complete_with_model(prompt)
print(response)

Troubleshooting

Common Issues

  1. Memory Errors

    Error: not enough memory
    

    Solution: Reduce --n-gpu-layers to 0 or use a smaller value

  2. Context Window Too Large

    Error: context size exceeded
    

    Solution: Reduce --ctx-size (e.g., --ctx-size 2048)

  3. CUDA Not Available

    Error: CUDA not found
    

    Solution: Remove --cuda flag or install CUDA drivers

  4. Port Already in Use

    Error: bind failed
    

    Solution: Use a different port with --port 8001

Performance Tuning

  • For faster inference: Increase --n-gpu-layers
  • For lower latency: Reduce --ctx-size
  • For better quality: Lower --temperature and increase --top-p
  • For creativity: Increase --temperature and adjust --top-k

System Requirements

  • RAM: Minimum 8GB, recommended 16GB+
  • GPU: Optional but recommended for better performance
  • Storage: Model file size + 2x for temporary files

Generated on 2025-09-17 20:07:11