Diabetes Risk Random Forest Ensemble
This repository contains a scikit-learn random forest ensemble for binary diabetes risk classification from structured health survey inputs. The model predicts whether a person is in the No diabetes class or the Prediabetes or diabetes class, and returns the estimated probability for the positive class.
Model Details
- Model type: Random forest ensemble
- Number of classifiers: 10
- Decision threshold: 0.60
- Target definition:
0 = no diabetes,1 = prediabetes or diabetes - Positive class label:
Prediabetes or diabetes - Serialized artifact:
models/random_forest_undersampling_ensemble_threshold_060.joblib - Artifact size: about 1.5 GB
The model artifact is a Python dictionary with:
ensemble: estimators and serving metadatafeature_columns: expected feature ordertarget_definition: label definitiontest_metrics: evaluation summary
Intended Use
This model is intended for educational and prototype use with tabular health-risk data shaped like the input schema below. It may be useful for demonstrating model serving, risk stratification workflows, or an API-backed machine learning project.
Do not use this model as a medical diagnostic system or as the sole basis for healthcare decisions. Diabetes screening, diagnosis, and treatment require clinical evaluation by qualified healthcare professionals.
Inputs
The model expects one row with these 21 fields:
| Feature | Type | Description |
|---|---|---|
HighBP |
integer | High blood pressure indicator, usually 0 or 1. |
HighChol |
integer | High cholesterol indicator, usually 0 or 1. |
CholCheck |
integer | Cholesterol check indicator, usually 0 or 1. |
BMI |
float | Body mass index. |
Smoker |
integer | Smoking history indicator, usually 0 or 1. |
Stroke |
integer | Stroke history indicator, usually 0 or 1. |
HeartDiseaseorAttack |
integer | Heart disease or heart attack history indicator, usually 0 or 1. |
PhysActivity |
integer | Physical activity indicator, usually 0 or 1. |
Fruits |
integer | Fruit consumption indicator, usually 0 or 1. |
Veggies |
integer | Vegetable consumption indicator, usually 0 or 1. |
HvyAlcoholConsump |
integer | Heavy alcohol consumption indicator, usually 0 or 1. |
AnyHealthcare |
integer | Healthcare coverage/access indicator, usually 0 or 1. |
NoDocbcCost |
integer | Could not see a doctor because of cost indicator, usually 0 or 1. |
GenHlth |
integer | General health category. |
MentHlth |
integer | Number of poor mental health days. |
PhysHlth |
integer | Number of poor physical health days. |
DiffWalk |
integer | Difficulty walking indicator, usually 0 or 1. |
Sex |
integer | Encoded sex category. |
Age |
integer | Encoded age category. |
Education |
integer | Encoded education category. |
Income |
integer | Encoded income category. |
Output
The FastAPI endpoint returns:
{
"prediction": 1,
"diabetes_probability": 0.7342,
"prediction_label": "Prediabetes or diabetes"
}
diabetes_probability is the average positive-class probability across the ensemble. The final class is 1 when this probability is greater than or equal to 0.60.
Evaluation
The saved artifact includes the following test metrics:
| Metric | Value |
|---|---|
| Negative precision | 0.9376 |
| Negative recall | 0.7521 |
| Positive precision | 0.3559 |
| Positive recall | 0.7323 |
| Positive F1 | 0.4791 |
| Macro F1 | 0.6569 |
| Balanced accuracy | 0.7422 |
The threshold favors positive-class recall over positive-class precision. This means the model is more likely to flag possible diabetes risk, but many positive predictions may be false positives.
Local Usage
Install dependencies:
pip install -r requirements.txt
Run the API:
uvicorn main:app --host 0.0.0.0 --port 8000
Send a prediction request:
curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d @sample_input.json
Or run the standalone example:
python inference.py
Python Loading Example
import joblib
import pandas as pd
artifact = joblib.load("models/random_forest_undersampling_ensemble_threshold_060.joblib")
ensemble = artifact["ensemble"]
feature_columns = artifact["feature_columns"]
row = {
"HighBP": 1,
"HighChol": 1,
"CholCheck": 1,
"BMI": 31.5,
"Smoker": 0,
"Stroke": 0,
"HeartDiseaseorAttack": 0,
"PhysActivity": 1,
"Fruits": 1,
"Veggies": 1,
"HvyAlcoholConsump": 0,
"AnyHealthcare": 1,
"NoDocbcCost": 0,
"GenHlth": 3,
"MentHlth": 2,
"PhysHlth": 4,
"DiffWalk": 0,
"Sex": 1,
"Age": 9,
"Education": 5,
"Income": 6
}
input_df = pd.DataFrame([row])[feature_columns]
probability = sum(
estimator.predict_proba(input_df)[0][1]
for estimator in ensemble["estimators"]
) / len(ensemble["estimators"])
prediction = int(probability >= ensemble["threshold"])
label = ensemble["class_mapping"][prediction]
Limitations and Bias
- The model uses survey-style proxy variables and encoded demographic or socioeconomic fields.
- Performance can vary across populations, data collection methods, and clinical settings.
- The positive-class precision in the saved test metrics is low, so positive predictions should be treated as screening signals rather than conclusions.
- The repository does not currently include the original training dataset or a full training script, so independent reproduction of the reported metrics is limited.
- Any deployment in a real healthcare context should include dataset review, fairness analysis, calibration checks, clinical validation, privacy review, and human oversight.
Recommended Repository Layout
.
βββ app/
β βββ schemas.py
βββ models/
β βββ random_forest_undersampling_ensemble_threshold_060.joblib
βββ src/
β βββ model_features.py
βββ inference.py
βββ main.py
βββ requirements.txt
βββ sample_input.json
License
No license has been specified yet. Add a license before sharing, reusing, or deploying the model outside your own controlled project context.