YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Indic Tokenizer v2

Custom SentencePiece Unigram tokenizer trained on:

  • Hindi, Tamil, Telugu corpora
  • Code-mixed Hinglish data

Features

  • 40–70% fewer tokens vs GPT-2
  • Script-aware tokenization
  • Better handling of Indic languages

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained( "your-username/indic-tokenizer-v2", trust_remote_code=True )

print(tokenizer.tokenize("नमस्ते मित्र, कैसे हो?"))

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support