Diabetes Risk Random Forest Ensemble

This repository contains a scikit-learn random forest ensemble for binary diabetes risk classification from structured health survey inputs. The model predicts whether a person is in the No diabetes class or the Prediabetes or diabetes class, and returns the estimated probability for the positive class.

Model Details

Model type: Random forest ensemble
Number of classifiers: 10
Decision threshold: 0.60
Target definition: 0 = no diabetes, 1 = prediabetes or diabetes
Positive class label: Prediabetes or diabetes
Serialized artifact: models/random_forest_undersampling_ensemble_threshold_060.joblib
Artifact size: about 1.5 GB

The model artifact is a Python dictionary with:

ensemble: estimators and serving metadata
feature_columns: expected feature order
target_definition: label definition
test_metrics: evaluation summary

Intended Use

This model is intended for educational and prototype use with tabular health-risk data shaped like the input schema below. It may be useful for demonstrating model serving, risk stratification workflows, or an API-backed machine learning project.

Do not use this model as a medical diagnostic system or as the sole basis for healthcare decisions. Diabetes screening, diagnosis, and treatment require clinical evaluation by qualified healthcare professionals.

Inputs

The model expects one row with these 21 fields:

Feature	Type	Description
`HighBP`	integer	High blood pressure indicator, usually `0` or `1`.
`HighChol`	integer	High cholesterol indicator, usually `0` or `1`.
`CholCheck`	integer	Cholesterol check indicator, usually `0` or `1`.
`BMI`	float	Body mass index.
`Smoker`	integer	Smoking history indicator, usually `0` or `1`.
`Stroke`	integer	Stroke history indicator, usually `0` or `1`.
`HeartDiseaseorAttack`	integer	Heart disease or heart attack history indicator, usually `0` or `1`.
`PhysActivity`	integer	Physical activity indicator, usually `0` or `1`.
`Fruits`	integer	Fruit consumption indicator, usually `0` or `1`.
`Veggies`	integer	Vegetable consumption indicator, usually `0` or `1`.
`HvyAlcoholConsump`	integer	Heavy alcohol consumption indicator, usually `0` or `1`.
`AnyHealthcare`	integer	Healthcare coverage/access indicator, usually `0` or `1`.
`NoDocbcCost`	integer	Could not see a doctor because of cost indicator, usually `0` or `1`.
`GenHlth`	integer	General health category.
`MentHlth`	integer	Number of poor mental health days.
`PhysHlth`	integer	Number of poor physical health days.
`DiffWalk`	integer	Difficulty walking indicator, usually `0` or `1`.
`Sex`	integer	Encoded sex category.
`Age`	integer	Encoded age category.
`Education`	integer	Encoded education category.
`Income`	integer	Encoded income category.

Output

The FastAPI endpoint returns:

{
  "prediction": 1,
  "diabetes_probability": 0.7342,
  "prediction_label": "Prediabetes or diabetes"
}

diabetes_probability is the average positive-class probability across the ensemble. The final class is 1 when this probability is greater than or equal to 0.60.

Evaluation

The saved artifact includes the following test metrics:

Metric	Value
Negative precision	0.9376
Negative recall	0.7521
Positive precision	0.3559
Positive recall	0.7323
Positive F1	0.4791
Macro F1	0.6569
Balanced accuracy	0.7422

The threshold favors positive-class recall over positive-class precision. This means the model is more likely to flag possible diabetes risk, but many positive predictions may be false positives.

Local Usage

Install dependencies:

pip install -r requirements.txt

Run the API:

uvicorn main:app --host 0.0.0.0 --port 8000

Send a prediction request:

curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d @sample_input.json

Or run the standalone example:

python inference.py

Python Loading Example

import joblib
import pandas as pd

artifact = joblib.load("models/random_forest_undersampling_ensemble_threshold_060.joblib")
ensemble = artifact["ensemble"]
feature_columns = artifact["feature_columns"]

row = {
    "HighBP": 1,
    "HighChol": 1,
    "CholCheck": 1,
    "BMI": 31.5,
    "Smoker": 0,
    "Stroke": 0,
    "HeartDiseaseorAttack": 0,
    "PhysActivity": 1,
    "Fruits": 1,
    "Veggies": 1,
    "HvyAlcoholConsump": 0,
    "AnyHealthcare": 1,
    "NoDocbcCost": 0,
    "GenHlth": 3,
    "MentHlth": 2,
    "PhysHlth": 4,
    "DiffWalk": 0,
    "Sex": 1,
    "Age": 9,
    "Education": 5,
    "Income": 6
}

input_df = pd.DataFrame([row])[feature_columns]
probability = sum(
    estimator.predict_proba(input_df)[0][1]
    for estimator in ensemble["estimators"]
) / len(ensemble["estimators"])
prediction = int(probability >= ensemble["threshold"])
label = ensemble["class_mapping"][prediction]

Limitations and Bias

The model uses survey-style proxy variables and encoded demographic or socioeconomic fields.
Performance can vary across populations, data collection methods, and clinical settings.
The positive-class precision in the saved test metrics is low, so positive predictions should be treated as screening signals rather than conclusions.
The repository does not currently include the original training dataset or a full training script, so independent reproduction of the reported metrics is limited.
Any deployment in a real healthcare context should include dataset review, fairness analysis, calibration checks, clinical validation, privacy review, and human oversight.

Recommended Repository Layout

.
├── app/
│   └── schemas.py
├── models/
│   └── random_forest_undersampling_ensemble_threshold_060.joblib
├── src/
│   └── model_features.py
├── inference.py
├── main.py
├── requirements.txt
└── sample_input.json

License

No license has been specified yet. Add a license before sharing, reusing, or deploying the model outside your own controlled project context.

Downloads last month: -; Downloads are not tracked for this model. How to track