Instructions to use tiiuae/falcon-40b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tiiuae/falcon-40b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="tiiuae/falcon-40b", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-40b", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use tiiuae/falcon-40b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "tiiuae/falcon-40b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-40b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/tiiuae/falcon-40b
- SGLang
How to use tiiuae/falcon-40b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "tiiuae/falcon-40b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-40b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "tiiuae/falcon-40b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-40b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use tiiuae/falcon-40b with Docker Model Runner:
docker model run hf.co/tiiuae/falcon-40b
Best Practice for Handling Variable-Length Sequences in Training an LLM Model on a Chatbot Dataset
I am currently engaged in training Falcon (LLM) on a chatbot dataset, and I would appreciate some guidance on handling variable-length sequences within the dataset. The dataset consists of multiple examples of chat messages exchanged between user 1 and user 2, totaling around 500 such instances. Each example varies in the number of messages it contains, leading to differing sequence lengths.
Here are two representative data points from the dataset:
Datapoint 1 = """user 1 : How are you ?\n user 2 : I am good. \n user 1 : What do you like ? \n user 2 : Apples"""
Datapoint 2 = """user 1 : How are you ?\nuser 2 : I am good.\n user 1 : What do you like in fruits?\n user 2 : Oranges \nuser 1 : Great me too\n user 2 : But sometimes I like mangoes \nuser 1 : seems intresting \n user 2 : Yeah"""
To facilitate the training process, I tokenized the dataset, setting a maximum_length of input_ids to 4 tokens, and handled overflowed tokens by padding them accordingly.
Now, my question is: in cases where a chat message contains fewer than 4 tokens, what is considered a best practice? Should I pad these shorter sequences to match the maximum length, or would it be more suitable to keep them as they are?
I would appreciate any insights or suggestions on the most appropriate approach for handling variable-length sequences in this context.