YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Indic Tokenizer v2
Custom SentencePiece Unigram tokenizer trained on:
- Hindi, Tamil, Telugu corpora
- Code-mixed Hinglish data
Features
- 40–70% fewer tokens vs GPT-2
- Script-aware tokenization
- Better handling of Indic languages
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained( "your-username/indic-tokenizer-v2", trust_remote_code=True )
print(tokenizer.tokenize("नमस्ते मित्र, कैसे हो?"))
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support