--- license: mit language: - en tags: - tabular-regression - student-performance - education - scikit-learn - random-forest datasets: - larsen0966/student-performance-data-set metrics: - mae - rmse - r2 model-index: - name: Student Performance Predictor results: - task: type: tabular-regression name: Student Grade Prediction dataset: name: Student Performance Data Set type: tabular metrics: - type: mae value: 2.0 name: Mean Absolute Error - type: rmse value: 2.8 name: Root Mean Squared Error - type: r2 value: 0.25 name: R² Score --- # 🎓 Student Performance Predictor ## Model Description Модель для предсказания академической успеваемости студентов (финальной оценки G3) на основе демографических, социальных и поведенческих факторов. Обучена на данных португальских школ с использованием Random Forest Regressor. - **Developed by:** SergeyR256 - **Model type:** Random Forest Regressor - **Framework:** scikit-learn 1.5.2 - **License:** MIT ## Intended Use ### Primary Use Предсказание финальной оценки студента по 20-балльной шкале для: - Выявления студентов в зоне риска - Планирования образовательных интервенций - Исследования факторов, влияющих на успеваемость ### Out-of-Scope Use - Не использовать для принятия окончательных решений об отчислении - Не применять к студентам младше 15 или старше 22 лет - Не использовать за пределами португальской образовательной системы без дополнительной валидации ## Training Data ### Data Description - **Source:** UCI Machine Learning Repository - **Authors:** P. Cortez and A. Silva (2008) - **Samples:** 649 студентов - **Features:** 30 признаков + 7 engineered features ### Feature Categories | Category | Features | |----------|----------| | Demographics | age, sex, address, famsize, Pstatus | | Family | Medu, Fedu, Mjob, Fjob, guardian, famrel | | School | school, studytime, traveltime, failures, absences, reason | | Support | schoolsup, famsup, paid, activities, nursery, higher, internet | | Lifestyle | freetime, goout, Dalc, Walc, health, romantic | ### Target Variable - **G3:** Final grade (0-20 scale) - **Note:** G1 and G2 (interim grades) were excluded to avoid data leakage ### Data Splits | Split | Samples | Percentage | |-------|---------|------------| | Train | 519 | 80% | | Test | 130 | 20% | ## Training Procedure ### Preprocessing - **Numeric features:** Median imputation + StandardScaler - **Nominal features:** Most frequent imputation + OneHotEncoder - **Ordinal features:** Most frequent imputation + OrdinalEncoder ### Feature Engineering ```python df['total_alcohol'] = df['Dalc'] + df['Walc'] df['parent_edu_avg'] = (df['Medu'] + df['Fedu']) / 2 df['goout_studytime_ratio'] = df['goout'] / (df['studytime'] + 1) df['alcohol_study_interaction'] = df['total_alcohol'] * df['studytime'] df['failures_squared'] = df['failures'] ** 2 df['age_squared'] = df['age'] ** 2 df['health_freetime_ratio'] = df['health'] / (df['freetime'] + 1) ``` ### Hyperparameters Grid search with 5-fold cross-validation: | Parameter | Search Space | Best Value |--|--|-- | n_estimators | 100-500 | 300 | max_depth | 10-25 | 20 | min_samples_split 2-10 | 5 | min_samples_leaf | 1-4 | 2 | max_features | sqrt, log2, 0.5 | sqrt ## Training Script The complete training pipeline is available in the [GitHub repository](https://github.com/Reactivity512/student-performance-prediction) and [Jupyter notebook](https://github.com/Reactivity512/student-performance-prediction/blob/main/notebooks/01_eda_and_training.ipynb). ## Evaluation Results Test Set Metrics | Metric | Value | Interpretation |--|--|-- | MAE | 2.0 балла | Средняя ошибка в 2 балла из 20 | RMSE | 2.8 балла | Большие ошибки штрафуются сильнее | R² | 0.25 | Модель объясняет 25% вариации ### Cross-Validation * 5-fold CV MAE: 1.97 ± 0.12 балла * Stability: Низкая вариация указывает на стабильность модели ### 📝 Категории оценок | Баллы | Категория | Эмодзи | |-------|-----------|--------| | 0-7 | Неудовлетворительно | 🔴 | | 8-9 | Ниже среднего | 🟠 | | 10-11 | Средне | 🟡 | | 12-13 | Хорошо | 🟢 | | 14-15 | Очень хорошо | 🔵 | | 16-20 | Отлично | 🟣 | ### Error Analysis by Grade Range | Actual Grade Range | Count | Mean Absolute Error |--|--|-- | 0-8 | 15 | 3.2 | 9-10 | 28 | 2.5 | 11-12 | 35 | 2.1 | 13-14 | 30 | 1.8 | 15-16 | 15 | 2.0 | 17-20 | 7 | 2.8 `Note: Model performs better on middle-range grades, struggles with extremes` ### Limitations and Biases Known Limitations * Regional Specificity: Trained only on Portuguese schools, may not generalize to other educational systems * Grade Range: Predictions bounded to [0, 20], but model may extrapolate poorly at extremes * Temporal Validity: Data from 2008, may not reflect current educational context * Missing G1/G2: Excluding interim grades makes prediction harder but more useful ## How to Use ### Load the Model ```py import joblib # Load from local file model = joblib.load('path/to/model.joblib') ``` ### Make a Prediction ```py import pandas as pd # Example input input_data = { 'age': 18, 'Medu': 3, 'Fedu': 2, 'traveltime': 2, 'studytime': 2, 'failures': 0, 'famrel': 4, 'freetime': 3, 'goout': 3, 'Dalc': 1, 'Walc': 2, 'health': 5, 'absences': 4, 'school': 'GP', 'sex': 'F', 'address': 'U', 'famsize': 'GT3', 'Pstatus': 'T', 'Mjob': 'teacher', 'Fjob': 'other', 'reason': 'course', 'guardian': 'mother', 'schoolsup': 'no', 'famsup': 'yes', 'paid': 'no', 'activities': 'yes', 'nursery': 'yes', 'higher': 'yes', 'internet': 'yes', 'romantic': 'no' } # Predict input_df = pd.DataFrame([input_data]) prediction = model.predict(input_df)[0] print(f"Predicted grade: {prediction:.2f}/20") ``` ### API Deployment ```py # FastAPI example from fastapi import FastAPI import joblib app = FastAPI() model = joblib.load('best_model_random_forest.joblib') @app.post("/predict") async def predict(data: dict): import pandas as pd df = pd.DataFrame([data]) prediction = model.predict(df)[0] return {"predicted_grade": round(float(prediction), 2)} ``` ## 📚 Источник данных Модель обучена на [Student Performance Data Set](https://archive.ics.uci.edu/ml/datasets/student+performance) из UCI Machine Learning Repository. **Авторы датасета:** P. Cortez and A. Silva, 2008 **Описание:** Данные о успеваемости учащихся двух португальских школ по предмету "Португальский язык". ## Resources * 📓 Jupyter Notebook: [Training and Analysis](https://github.com/Reactivity512/student-performance-prediction/blob/main/notebooks/01_eda_and_training.ipynb) * 📂 GitHub Repository: [Full Project](https://github.com/Reactivity512/student-performance-prediction) ## 👤 Contact SergeyR256 * GitHub: https://github.com/Reactivity512 * Hugging Face: https://huggingface.co/SergeyR256