do_lower_case=True not seeming to work

by thiagotps - opened 19 days ago

I'm testing version v1.1.1 with the following code

_tokenizer = AutoTokenizer.from_pretrained(
    "latincy/latin-bert", revision="v1.1.1", trust_remote_code=True, do_lower_case=True
)
_tokens = _tokenizer("Gallia est omnis divisa in partes tres.", return_tensors='pt')
_token_ids = _tokens['input_ids'][0]
_token_texts = _tokenizer.convert_ids_to_tokens(_token_ids)
_token_texts

and the result is

[
  "[CLS]",
  "\\",
  "71",
  ";",
  "allia",
  "_",
  "\\",
  "32",
  ";_",
  "est_",
  "\\",
  "32",
  ";_",
  "omnis_",
  "\\",
  "32",
  ";_",
  "divisa_",
  "\\",
  "32",
  ";_",
  "in_",
  "\\",
  "32",
  ";_",
  "partes_",
  "\\",
  "32",
  ";_",
  "tres_",
  "._",
  "[SEP]"
]

It seems like the lower() method is still not being applied internally because the capital G in Gallia was escaped by the tokenizer.

diyclassics

LatinCy org 19 days ago

Thank you for posting the Issue—I have been able to replicate this behavior. This turned out to be a packaging error not a code/model error, so I am going to force-update the v1.1.1 tag. The original snippet should now work (even if you do not specifically invoke do_lower_case=True; it is the config default.). Let me know if this works on your end and again thanks for the report.

diyclassics changed discussion status to closed 19 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment