🎓 Student Performance Predictor
Model Description
Модель для предсказания академической успеваемости студентов (финальной оценки G3) на основе демографических, социальных и поведенческих факторов. Обучена на данных португальских школ с использованием Random Forest Regressor.
- Developed by: SergeyR256
- Model type: Random Forest Regressor
- Framework: scikit-learn 1.5.2
- License: MIT
Intended Use
Primary Use
Предсказание финальной оценки студента по 20-балльной шкале для:
- Выявления студентов в зоне риска
- Планирования образовательных интервенций
- Исследования факторов, влияющих на успеваемость
Out-of-Scope Use
- Не использовать для принятия окончательных решений об отчислении
- Не применять к студентам младше 15 или старше 22 лет
- Не использовать за пределами португальской образовательной системы без дополнительной валидации
Training Data
Data Description
- Source: UCI Machine Learning Repository
- Authors: P. Cortez and A. Silva (2008)
- Samples: 649 студентов
- Features: 30 признаков + 7 engineered features
Feature Categories
| Category | Features |
|---|---|
| Demographics | age, sex, address, famsize, Pstatus |
| Family | Medu, Fedu, Mjob, Fjob, guardian, famrel |
| School | school, studytime, traveltime, failures, absences, reason |
| Support | schoolsup, famsup, paid, activities, nursery, higher, internet |
| Lifestyle | freetime, goout, Dalc, Walc, health, romantic |
Target Variable
- G3: Final grade (0-20 scale)
- Note: G1 and G2 (interim grades) were excluded to avoid data leakage
Data Splits
| Split | Samples | Percentage |
|---|---|---|
| Train | 519 | 80% |
| Test | 130 | 20% |
Training Procedure
Preprocessing
- Numeric features: Median imputation + StandardScaler
- Nominal features: Most frequent imputation + OneHotEncoder
- Ordinal features: Most frequent imputation + OrdinalEncoder
Feature Engineering
df['total_alcohol'] = df['Dalc'] + df['Walc']
df['parent_edu_avg'] = (df['Medu'] + df['Fedu']) / 2
df['goout_studytime_ratio'] = df['goout'] / (df['studytime'] + 1)
df['alcohol_study_interaction'] = df['total_alcohol'] * df['studytime']
df['failures_squared'] = df['failures'] ** 2
df['age_squared'] = df['age'] ** 2
df['health_freetime_ratio'] = df['health'] / (df['freetime'] + 1)
Hyperparameters
Grid search with 5-fold cross-validation:
| Parameter | Search Space | Best Value |
|---|---|---|
| n_estimators | 100-500 | 300 |
| max_depth | 10-25 | 20 |
| min_samples_split 2-10 | 5 | |
| min_samples_leaf | 1-4 | 2 |
| max_features | sqrt, log2, 0.5 | sqrt |
Training Script
The complete training pipeline is available in the GitHub repository and Jupyter notebook.
Evaluation Results
Test Set Metrics
| Metric | Value | Interpretation |
|---|---|---|
| MAE | 2.0 балла | Средняя ошибка в 2 балла из 20 |
| RMSE | 2.8 балла | Большие ошибки штрафуются сильнее |
| R² | 0.25 | Модель объясняет 25% вариации |
Cross-Validation
- 5-fold CV MAE: 1.97 ± 0.12 балла
- Stability: Низкая вариация указывает на стабильность модели
📝 Категории оценок
| Баллы | Категория | Эмодзи |
|---|---|---|
| 0-7 | Неудовлетворительно | 🔴 |
| 8-9 | Ниже среднего | 🟠 |
| 10-11 | Средне | 🟡 |
| 12-13 | Хорошо | 🟢 |
| 14-15 | Очень хорошо | 🔵 |
| 16-20 | Отлично | 🟣 |
Error Analysis by Grade Range
| Actual Grade Range | Count | Mean Absolute Error |
|---|---|---|
| 0-8 | 15 | 3.2 |
| 9-10 | 28 | 2.5 |
| 11-12 | 35 | 2.1 |
| 13-14 | 30 | 1.8 |
| 15-16 | 15 | 2.0 |
| 17-20 | 7 | 2.8 |
Note: Model performs better on middle-range grades, struggles with extremes
Limitations and Biases
Known Limitations
- Regional Specificity: Trained only on Portuguese schools, may not generalize to other educational systems
- Grade Range: Predictions bounded to [0, 20], but model may extrapolate poorly at extremes
- Temporal Validity: Data from 2008, may not reflect current educational context
- Missing G1/G2: Excluding interim grades makes prediction harder but more useful
How to Use
Load the Model
import joblib
# Load from local file
model = joblib.load('path/to/model.joblib')
Make a Prediction
import pandas as pd
# Example input
input_data = {
'age': 18, 'Medu': 3, 'Fedu': 2, 'traveltime': 2,
'studytime': 2, 'failures': 0, 'famrel': 4, 'freetime': 3,
'goout': 3, 'Dalc': 1, 'Walc': 2, 'health': 5, 'absences': 4,
'school': 'GP', 'sex': 'F', 'address': 'U', 'famsize': 'GT3',
'Pstatus': 'T', 'Mjob': 'teacher', 'Fjob': 'other',
'reason': 'course', 'guardian': 'mother', 'schoolsup': 'no',
'famsup': 'yes', 'paid': 'no', 'activities': 'yes',
'nursery': 'yes', 'higher': 'yes', 'internet': 'yes',
'romantic': 'no'
}
# Predict
input_df = pd.DataFrame([input_data])
prediction = model.predict(input_df)[0]
print(f"Predicted grade: {prediction:.2f}/20")
API Deployment
# FastAPI example
from fastapi import FastAPI
import joblib
app = FastAPI()
model = joblib.load('best_model_random_forest.joblib')
@app.post("/predict")
async def predict(data: dict):
import pandas as pd
df = pd.DataFrame([data])
prediction = model.predict(df)[0]
return {"predicted_grade": round(float(prediction), 2)}
📚 Источник данных
Модель обучена на Student Performance Data Set из UCI Machine Learning Repository.
Авторы датасета: P. Cortez and A. Silva, 2008
Описание: Данные о успеваемости учащихся двух португальских школ по предмету "Португальский язык".
Resources
- 📓 Jupyter Notebook: Training and Analysis
- 📂 GitHub Repository: Full Project
👤 Contact
SergeyR256
- GitHub: https://github.com/Reactivity512
- Hugging Face: https://huggingface.co/SergeyR256
Evaluation results
- Mean Absolute Error on Student Performance Data Setself-reported2.000
- Root Mean Squared Error on Student Performance Data Setself-reported2.800
- R² Score on Student Performance Data Setself-reported0.250