--- license: mit --- # Language Detection A lightweight language detection tool that uses character-level n-gram features and logistic regression to identify the language of a given text. Supported languages out of the box: English, French, German, Turkish. Model repository: https://huggingface.co/Isa0/language-detection/ ## Installation Requires Python 3.11 or higher. Install dependencies with [uv](https://github.com/astral-sh/uv): ```bash uv sync ``` ## Usage ### Train Train the model on the datasets in the `datasets/` directory: ```bash uv run main.py --train ``` You can point it to a different directory with `--dir`: ```bash uv run main.py --train --dir path/to/datasets ``` Each `.txt` file in the directory should contain one sentence per line. The filename (without extension) is used as the language label. ### Detect Detect the language of a text string: ```bash uv run main.py --detect "Bonjour, comment allez-vous?" ``` Output includes the predicted language and a confidence score. ## Adding Languages Add a new `.txt` file to the `datasets/` directory named after the language (e.g. `spanish.txt`), with one sentence per line, then retrain. ## How It Works Text is converted into character-level n-gram counts (1 to 3 characters), which capture language-specific patterns like accents, letter combinations, and suffixes. A logistic regression classifier is trained on these features and saved to disk for reuse.