Instructions to use VMware/xgen-7b-8k-open-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use VMware/xgen-7b-8k-open-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="VMware/xgen-7b-8k-open-instruct")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("VMware/xgen-7b-8k-open-instruct") model = AutoModelForCausalLM.from_pretrained("VMware/xgen-7b-8k-open-instruct") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use VMware/xgen-7b-8k-open-instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "VMware/xgen-7b-8k-open-instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "VMware/xgen-7b-8k-open-instruct", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/VMware/xgen-7b-8k-open-instruct
- SGLang
How to use VMware/xgen-7b-8k-open-instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "VMware/xgen-7b-8k-open-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "VMware/xgen-7b-8k-open-instruct", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "VMware/xgen-7b-8k-open-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "VMware/xgen-7b-8k-open-instruct", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use VMware/xgen-7b-8k-open-instruct with Docker Model Runner:
docker model run hf.co/VMware/xgen-7b-8k-open-instruct
Error in deploy
#1
by giuliogalvan - opened
I am trying to deploy (either through sagemaker or managed endpoints) this model to make extensive tests but I ran in the following problem.
This is a log extract from AWS sagemaker after the invocation of huggingface_model.deploy()
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'XgenTokenizer'.
The class this function is called from is 'LlamaTokenizer'.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
Error: ShardCannotStart
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 246, in get_model
return llama_cls(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 44, in __init__
tokenizer = LlamaTokenizer.from_pretrained(
File "/usr/src/transformers/src/transformers/tokenization_utils_base.py", line 1812, in from_pretrained
return cls._from_pretrained(
File "/usr/src/transformers/src/transformers/tokenization_utils_base.py", line 1975, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/usr/src/transformers/src/transformers/models/llama/tokenization_llama.py", line 96, in __init__
self.sp_model.Load(vocab_file)
File "/opt/conda/lib/python3.9/site-packages/sentencepiece/__init__.py", line 905, in Load
return self.LoadFromFile(model_file)
File "/opt/conda/lib/python3.9/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
Any help will be very appreciated :)