🎓 Student Performance Predictor

Model Description

Модель для предсказания академической успеваемости студентов (финальной оценки G3) на основе демографических, социальных и поведенческих факторов. Обучена на данных португальских школ с использованием Random Forest Regressor.

  • Developed by: SergeyR256
  • Model type: Random Forest Regressor
  • Framework: scikit-learn 1.5.2
  • License: MIT

Intended Use

Primary Use

Предсказание финальной оценки студента по 20-балльной шкале для:

  • Выявления студентов в зоне риска
  • Планирования образовательных интервенций
  • Исследования факторов, влияющих на успеваемость

Out-of-Scope Use

  • Не использовать для принятия окончательных решений об отчислении
  • Не применять к студентам младше 15 или старше 22 лет
  • Не использовать за пределами португальской образовательной системы без дополнительной валидации

Training Data

Data Description

  • Source: UCI Machine Learning Repository
  • Authors: P. Cortez and A. Silva (2008)
  • Samples: 649 студентов
  • Features: 30 признаков + 7 engineered features

Feature Categories

Category Features
Demographics age, sex, address, famsize, Pstatus
Family Medu, Fedu, Mjob, Fjob, guardian, famrel
School school, studytime, traveltime, failures, absences, reason
Support schoolsup, famsup, paid, activities, nursery, higher, internet
Lifestyle freetime, goout, Dalc, Walc, health, romantic

Target Variable

  • G3: Final grade (0-20 scale)
  • Note: G1 and G2 (interim grades) were excluded to avoid data leakage

Data Splits

Split Samples Percentage
Train 519 80%
Test 130 20%

Training Procedure

Preprocessing

  • Numeric features: Median imputation + StandardScaler
  • Nominal features: Most frequent imputation + OneHotEncoder
  • Ordinal features: Most frequent imputation + OrdinalEncoder

Feature Engineering

df['total_alcohol'] = df['Dalc'] + df['Walc']
df['parent_edu_avg'] = (df['Medu'] + df['Fedu']) / 2
df['goout_studytime_ratio'] = df['goout'] / (df['studytime'] + 1)
df['alcohol_study_interaction'] = df['total_alcohol'] * df['studytime']
df['failures_squared'] = df['failures'] ** 2
df['age_squared'] = df['age'] ** 2
df['health_freetime_ratio'] = df['health'] / (df['freetime'] + 1)

Hyperparameters

Grid search with 5-fold cross-validation:

Parameter Search Space Best Value
n_estimators 100-500 300
max_depth 10-25 20
min_samples_split 2-10 5
min_samples_leaf 1-4 2
max_features sqrt, log2, 0.5 sqrt

Training Script

The complete training pipeline is available in the GitHub repository and Jupyter notebook.

Evaluation Results

Test Set Metrics

Metric Value Interpretation
MAE 2.0 балла Средняя ошибка в 2 балла из 20
RMSE 2.8 балла Большие ошибки штрафуются сильнее
0.25 Модель объясняет 25% вариации

Cross-Validation

  • 5-fold CV MAE: 1.97 ± 0.12 балла
  • Stability: Низкая вариация указывает на стабильность модели

📝 Категории оценок

Баллы Категория Эмодзи
0-7 Неудовлетворительно 🔴
8-9 Ниже среднего 🟠
10-11 Средне 🟡
12-13 Хорошо 🟢
14-15 Очень хорошо 🔵
16-20 Отлично 🟣

Error Analysis by Grade Range

Actual Grade Range Count Mean Absolute Error
0-8 15 3.2
9-10 28 2.5
11-12 35 2.1
13-14 30 1.8
15-16 15 2.0
17-20 7 2.8

Note: Model performs better on middle-range grades, struggles with extremes

Limitations and Biases

Known Limitations

  • Regional Specificity: Trained only on Portuguese schools, may not generalize to other educational systems
  • Grade Range: Predictions bounded to [0, 20], but model may extrapolate poorly at extremes
  • Temporal Validity: Data from 2008, may not reflect current educational context
  • Missing G1/G2: Excluding interim grades makes prediction harder but more useful

How to Use

Load the Model

import joblib

# Load from local file
model = joblib.load('path/to/model.joblib')

Make a Prediction

import pandas as pd

# Example input
input_data = {
    'age': 18, 'Medu': 3, 'Fedu': 2, 'traveltime': 2,
    'studytime': 2, 'failures': 0, 'famrel': 4, 'freetime': 3,
    'goout': 3, 'Dalc': 1, 'Walc': 2, 'health': 5, 'absences': 4,
    'school': 'GP', 'sex': 'F', 'address': 'U', 'famsize': 'GT3',
    'Pstatus': 'T', 'Mjob': 'teacher', 'Fjob': 'other',
    'reason': 'course', 'guardian': 'mother', 'schoolsup': 'no',
    'famsup': 'yes', 'paid': 'no', 'activities': 'yes',
    'nursery': 'yes', 'higher': 'yes', 'internet': 'yes',
    'romantic': 'no'
}

# Predict
input_df = pd.DataFrame([input_data])
prediction = model.predict(input_df)[0]
print(f"Predicted grade: {prediction:.2f}/20")

API Deployment

# FastAPI example
from fastapi import FastAPI
import joblib

app = FastAPI()
model = joblib.load('best_model_random_forest.joblib')

@app.post("/predict")
async def predict(data: dict):
    import pandas as pd
    df = pd.DataFrame([data])
    prediction = model.predict(df)[0]
    return {"predicted_grade": round(float(prediction), 2)}

📚 Источник данных

Модель обучена на Student Performance Data Set из UCI Machine Learning Repository.

Авторы датасета: P. Cortez and A. Silva, 2008

Описание: Данные о успеваемости учащихся двух португальских школ по предмету "Португальский язык".

Resources

👤 Contact

SergeyR256

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

  • Mean Absolute Error on Student Performance Data Set
    self-reported
    2.000
  • Root Mean Squared Error on Student Performance Data Set
    self-reported
    2.800
  • R² Score on Student Performance Data Set
    self-reported
    0.250