Instructions to use VMware/xgen-7b-8k-open-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use VMware/xgen-7b-8k-open-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="VMware/xgen-7b-8k-open-instruct")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("VMware/xgen-7b-8k-open-instruct")
model = AutoModelForCausalLM.from_pretrained("VMware/xgen-7b-8k-open-instruct")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use VMware/xgen-7b-8k-open-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "VMware/xgen-7b-8k-open-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VMware/xgen-7b-8k-open-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/VMware/xgen-7b-8k-open-instruct

SGLang

How to use VMware/xgen-7b-8k-open-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "VMware/xgen-7b-8k-open-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VMware/xgen-7b-8k-open-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "VMware/xgen-7b-8k-open-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VMware/xgen-7b-8k-open-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use VMware/xgen-7b-8k-open-instruct with Docker Model Runner:
```
docker model run hf.co/VMware/xgen-7b-8k-open-instruct
```

Error in deploy

by giuliogalvan - opened Jul 7, 2023

Discussion

giuliogalvan

Jul 7, 2023

I am trying to deploy (either through sagemaker or managed endpoints) this model to make extensive tests but I ran in the following problem.

This is a log extract from AWS sagemaker after the invocation of huggingface_model.deploy()

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.

The tokenizer class you load from this checkpoint is 'XgenTokenizer'. 

The class this function is called from is 'LlamaTokenizer'.

Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
  sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
  server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
  asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
  return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
  return future.result()


Error: ShardCannotStart
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
  model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 246, in get_model
  return llama_cls(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 44, in __init__
  tokenizer = LlamaTokenizer.from_pretrained(
File "/usr/src/transformers/src/transformers/tokenization_utils_base.py", line 1812, in from_pretrained
  return cls._from_pretrained(
File "/usr/src/transformers/src/transformers/tokenization_utils_base.py", line 1975, in _from_pretrained
  tokenizer = cls(*init_inputs, **init_kwargs)
File "/usr/src/transformers/src/transformers/models/llama/tokenization_llama.py", line 96, in __init__
  self.sp_model.Load(vocab_file)
File "/opt/conda/lib/python3.9/site-packages/sentencepiece/__init__.py", line 905, in Load
  return self.LoadFromFile(model_file)
File "/opt/conda/lib/python3.9/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
  return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)

Any help will be very appreciated :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment