Instructions to use tiiuae/falcon-40b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tiiuae/falcon-40b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="tiiuae/falcon-40b", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-40b", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use tiiuae/falcon-40b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "tiiuae/falcon-40b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-40b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/tiiuae/falcon-40b
- SGLang
How to use tiiuae/falcon-40b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "tiiuae/falcon-40b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-40b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "tiiuae/falcon-40b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-40b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use tiiuae/falcon-40b with Docker Model Runner:
docker model run hf.co/tiiuae/falcon-40b
Falcon 40B Inference at 4bit in Google Colab
pinned🤗👍 27
27
#38 opened almost 3 years ago
by
serin32
Custom 4-bit Finetuning 5-7 times faster inference than QLora
pinned❤️👍 7
6
#25 opened almost 3 years ago
by
rmihaylov
Install & run tiiuae/falcon-40b easily using llmpm
#117 opened 3 months ago
by
sarthak-saxena
Free Image Generator
#116 opened 6 months ago
by
Ebuka-Great
remove-extra-parentheses
#115 opened almost 2 years ago
by
ZennyKenny
Could not locate the configuration_RW.py inside tiiuae/falcon-40b-instruct.
#114 opened about 2 years ago
by
cosmino
[AUTOMATED] Model Memory Requirements
#113 opened about 2 years ago
by
model-sizer-bot
Adding Evaluation Results
#111 opened about 2 years ago
by
leaderboard-pr-bot
Could someone upload a tokenizer.model file? to allow for making ggufs
#110 opened over 2 years ago
by
RonanMcGovern
Add chat_template so that it can be used for chat out-of-box
#109 opened over 2 years ago
by
chujiezheng
pb when testing the model
#108 opened over 2 years ago
by
louvivien
Update generation_config.json
1
#106 opened over 2 years ago
by
nkasmanoff
Gradio interface
#105 opened almost 3 years ago
by
sequentialsystems
Optimizing Inference Time for Chat Conversations on Falcon
👍 1
2
#104 opened almost 3 years ago
by
humza-sami
Finetuned Falcon40 is not working with pipeline (text-generation)
#103 opened almost 3 years ago
by
chelouche9
Advice on inference over a large-ish dataset in Databricks?
#102 opened almost 3 years ago
by
archonlith
Use input attention mask instead of casual mask in attention
#101 opened almost 3 years ago
by
CyberZHG
Inference
4
#99 opened almost 3 years ago
by
davidhung
Best Practice for Handling Variable-Length Sequences in Training an LLM Model on a Chatbot Dataset
#98 opened almost 3 years ago
by
humza-sami
Request: DOI
#97 opened almost 3 years ago
by
waelTalan
Getting HTTP Error Code: 422 when using Inference API
2
#96 opened almost 3 years ago
by
reetkat
Run falcon on Mac
2
#95 opened almost 3 years ago
by
corin9122
Unable to use all cores.
2
#94 opened almost 3 years ago
by
armx40
Bug: the model's head dimensionality is hardcoded
#93 opened almost 3 years ago
by deleted
Fine-tune on model response only?
👍 1
1
#92 opened almost 3 years ago
by
mkserge
Finetuning Base Falcon on Unseen Language/New data (non instruct/RLHF)
2
#91 opened almost 3 years ago
by
AshBam
Slow response time for 7b and 40b
6
#89 opened almost 3 years ago
by
kartik99
configuration_RW.py Missing in the latest commit
#88 opened almost 3 years ago
by
ravikiran3690
Update README.md
2
#87 opened almost 3 years ago
by
FelixMildon
Falcon breaks after the second prompt of code.
#86 opened almost 3 years ago
by
thecowmilk
Changes in modelling_RW.py to be able to handle past_key_values for faster model generations
8
#85 opened almost 3 years ago
by
puru22
@TII Falcon is stunning but will you continue or is the majestic bird destined to starve ?
#84 opened almost 3 years ago
by
cmp-nct
Finetune Error using the notebook referred on the model page
#83 opened almost 3 years ago
by
hamad
Nvidia H100 Finetuning Error on BitsandBytes
2
#82 opened almost 3 years ago
by
ashmitbhattarai
new here, confused which .bin file to download?
#80 opened almost 3 years ago
by
kingofdelphi
Update generation_config.json
#77 opened almost 3 years ago
by
psinger
Request: DOI
#76 opened almost 3 years ago
by
winter6below618
Seeking insights on integrating RAG with Falcon for Domain Specific requirements
#75 opened almost 3 years ago
by
rahul2008d
Prevent Hallucinations
1
#74 opened almost 3 years ago
by
Zhaoqiong
Deployment on Azure ML
1
#73 opened almost 3 years ago
by
Eliahu551818
Access To Hidden States
#72 opened almost 3 years ago
by
DJT777
Were special tokens trained?
#71 opened almost 3 years ago
by
Tron2060
Example code from README output is nonsense
1
#70 opened almost 3 years ago
by
amitgurintecom
New language
2
#69 opened almost 3 years ago
by
mindplay
GPU requirements
👍 6
7
#68 opened almost 3 years ago
by
GuySerk
Cuda out of memory error.
2
#67 opened almost 3 years ago
by
ibrim
ValueError: The following model_kwargs are not used by the model: ['token_type_ids'] (note: typos in the generate arguments will also show up in this list)
1
#66 opened almost 3 years ago
by
yiz4869
How to fine tune falcon for summarization on xsum?
1
#65 opened almost 3 years ago
by
uzumakiusa
Need claritiy about the adjustable model hyperparameters
#64 opened almost 3 years ago
by
Someshfengde
Update README.md
#63 opened almost 3 years ago
by
Gage888