Diabetes Risk Random Forest Ensemble

This repository contains a scikit-learn random forest ensemble for binary diabetes risk classification from structured health survey inputs. The model predicts whether a person is in the No diabetes class or the Prediabetes or diabetes class, and returns the estimated probability for the positive class.

Model Details

  • Model type: Random forest ensemble
  • Number of classifiers: 10
  • Decision threshold: 0.60
  • Target definition: 0 = no diabetes, 1 = prediabetes or diabetes
  • Positive class label: Prediabetes or diabetes
  • Serialized artifact: models/random_forest_undersampling_ensemble_threshold_060.joblib
  • Artifact size: about 1.5 GB

The model artifact is a Python dictionary with:

  • ensemble: estimators and serving metadata
  • feature_columns: expected feature order
  • target_definition: label definition
  • test_metrics: evaluation summary

Intended Use

This model is intended for educational and prototype use with tabular health-risk data shaped like the input schema below. It may be useful for demonstrating model serving, risk stratification workflows, or an API-backed machine learning project.

Do not use this model as a medical diagnostic system or as the sole basis for healthcare decisions. Diabetes screening, diagnosis, and treatment require clinical evaluation by qualified healthcare professionals.

Inputs

The model expects one row with these 21 fields:

Feature Type Description
HighBP integer High blood pressure indicator, usually 0 or 1.
HighChol integer High cholesterol indicator, usually 0 or 1.
CholCheck integer Cholesterol check indicator, usually 0 or 1.
BMI float Body mass index.
Smoker integer Smoking history indicator, usually 0 or 1.
Stroke integer Stroke history indicator, usually 0 or 1.
HeartDiseaseorAttack integer Heart disease or heart attack history indicator, usually 0 or 1.
PhysActivity integer Physical activity indicator, usually 0 or 1.
Fruits integer Fruit consumption indicator, usually 0 or 1.
Veggies integer Vegetable consumption indicator, usually 0 or 1.
HvyAlcoholConsump integer Heavy alcohol consumption indicator, usually 0 or 1.
AnyHealthcare integer Healthcare coverage/access indicator, usually 0 or 1.
NoDocbcCost integer Could not see a doctor because of cost indicator, usually 0 or 1.
GenHlth integer General health category.
MentHlth integer Number of poor mental health days.
PhysHlth integer Number of poor physical health days.
DiffWalk integer Difficulty walking indicator, usually 0 or 1.
Sex integer Encoded sex category.
Age integer Encoded age category.
Education integer Encoded education category.
Income integer Encoded income category.

Output

The FastAPI endpoint returns:

{
  "prediction": 1,
  "diabetes_probability": 0.7342,
  "prediction_label": "Prediabetes or diabetes"
}

diabetes_probability is the average positive-class probability across the ensemble. The final class is 1 when this probability is greater than or equal to 0.60.

Evaluation

The saved artifact includes the following test metrics:

Metric Value
Negative precision 0.9376
Negative recall 0.7521
Positive precision 0.3559
Positive recall 0.7323
Positive F1 0.4791
Macro F1 0.6569
Balanced accuracy 0.7422

The threshold favors positive-class recall over positive-class precision. This means the model is more likely to flag possible diabetes risk, but many positive predictions may be false positives.

Local Usage

Install dependencies:

pip install -r requirements.txt

Run the API:

uvicorn main:app --host 0.0.0.0 --port 8000

Send a prediction request:

curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d @sample_input.json

Or run the standalone example:

python inference.py

Python Loading Example

import joblib
import pandas as pd

artifact = joblib.load("models/random_forest_undersampling_ensemble_threshold_060.joblib")
ensemble = artifact["ensemble"]
feature_columns = artifact["feature_columns"]

row = {
    "HighBP": 1,
    "HighChol": 1,
    "CholCheck": 1,
    "BMI": 31.5,
    "Smoker": 0,
    "Stroke": 0,
    "HeartDiseaseorAttack": 0,
    "PhysActivity": 1,
    "Fruits": 1,
    "Veggies": 1,
    "HvyAlcoholConsump": 0,
    "AnyHealthcare": 1,
    "NoDocbcCost": 0,
    "GenHlth": 3,
    "MentHlth": 2,
    "PhysHlth": 4,
    "DiffWalk": 0,
    "Sex": 1,
    "Age": 9,
    "Education": 5,
    "Income": 6
}

input_df = pd.DataFrame([row])[feature_columns]
probability = sum(
    estimator.predict_proba(input_df)[0][1]
    for estimator in ensemble["estimators"]
) / len(ensemble["estimators"])
prediction = int(probability >= ensemble["threshold"])
label = ensemble["class_mapping"][prediction]

Limitations and Bias

  • The model uses survey-style proxy variables and encoded demographic or socioeconomic fields.
  • Performance can vary across populations, data collection methods, and clinical settings.
  • The positive-class precision in the saved test metrics is low, so positive predictions should be treated as screening signals rather than conclusions.
  • The repository does not currently include the original training dataset or a full training script, so independent reproduction of the reported metrics is limited.
  • Any deployment in a real healthcare context should include dataset review, fairness analysis, calibration checks, clinical validation, privacy review, and human oversight.

Recommended Repository Layout

.
β”œβ”€β”€ app/
β”‚   └── schemas.py
β”œβ”€β”€ models/
β”‚   └── random_forest_undersampling_ensemble_threshold_060.joblib
β”œβ”€β”€ src/
β”‚   └── model_features.py
β”œβ”€β”€ inference.py
β”œβ”€β”€ main.py
β”œβ”€β”€ requirements.txt
└── sample_input.json

License

No license has been specified yet. Add a license before sharing, reusing, or deploying the model outside your own controlled project context.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support