Bilingual Children Speech Classifiers

This repository contains the three trained classifiers for predicting a bilingual child's first language (L1) from English child speech samples.

The three classifiers are:

linear_svm
logistic_regression
random_forest

The classifier expects dense sentence embeddings produced with:

sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

Files

  • linear_svm.joblib: trained linear svm classifier.
  • logistic_regression.joblib: trained logistic regression classifier.
  • random_forest.joblib: trained scikit-learn Random Forest classifier.
  • label_encoder.joblib: fitted label encoder used to map numeric model classes back to readable L1 labels.
  • run_metadata.json: metadata for trained classifiers.
  • model_comparison.csv: accuracy and macro-f1 comparison between models.

Intended Use

This model was created for an academic text classification assignment using cleaned CHILDES/CHAT bilingual child speech data and CodeX. It is designed to be used with the companion Hugging Face Space or with a local inference script that first embeds text using the Sentence Transformer model above.

Minimal Usage

Example case is for Random Forest classifer but can be switched out for another.

import joblib
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer(
    "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
)
classifier = joblib.load("random_forest.joblib")
label_encoder = joblib.load("label_encoder.joblib")

text = "I want to play with the toys and then go outside"
embedding = embedding_model.encode([text], convert_to_numpy=True)

encoded_prediction = classifier.predict(embedding)
prediction = label_encoder.inverse_transform(encoded_prediction)

print(prediction[0])

Limitations

This model was trained on a small academic dataset and should not be interpreted as a general-purpose or diagnostic language-background detector. Predictions are best understood as an exploratory machine-learning result within the scope of the training data and preprocessing pipeline.

Assignment links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support