ο»Ώ# Programming Paradigm Classifier
Two specialized machine learning classifiers for identifying programming paradigms in natural language queries:
- Conceptual Classifier: For short, natural language questions about programming concepts
- Technical Classifier: For long, code-heavy StackOverflow-style questions with technical details
Check logs directory for detailed run of experiments using each of these classifiers
When to Use Each Classifier
| Classifier | Input Type | Use Case | Example |
|---|---|---|---|
| Conceptual | Short (10-50 words), natural language, no code | Design decisions, patterns, paradigm questions | "How do I make this function pure?" |
| Technical | Long (100-250 tokens), title+body, with code/errors | StackOverflow-style debugging, implementation | "What happens in heap when class A inherits B?" + [code] |
Why two classifiers?
- Different signals: Conceptual uses semantic meaning, Technical uses code patterns + syntax
- Different models: Conceptual = BAAI embeddings + SVM , Technical = CodeBERT + XGBoost
- Open Ended PS: The problem statement did not specify the kind of natural language queries that will be given to the system, so I built it for both kind of queries - simple and short / long and technical
Datasets
| Classifier | Source | Size | Characteristics |
|---|---|---|---|
| Conceptual | Curated synthetic questions | 246 samples | Short (10-50 words), natural language, paradigm-focused, no code |
| Technical | StackOverflow | 57,235 train | Long format (100-250 tokens), code snippets, errors, real-world posts |
Rationale: Conceptual uses dedicated dataset to capture paradigm phrasing without code noise. Technical uses StackOverflow for authentic debugging/implementation queries with rich technical context.
Quick Start
1. Install
pip install -r requirements.txt
2. Run Inference
Conceptual Classifier:
append your queries in the test_texts lists of concept-classifier/
cd concept-classifier
python inference.pyinference.py
Interactive mode:
just run this and keep querying in the command line as long as you like, one at a time
bash python inference.py --interactive
Technical Classifier:
Change the variable text input in main of technical-classifier/inference.py
cd technical-classifier
python inference.py
Inference
Concept Classifier
Architecture: BAAI/bge-base-en-v1.5 (768D) β Linear SVM β Calibrated probabilities
Classes: functional, oop, procedural, none
Uncertainty: Returns "X or Y" if margin < 0.05 | Returns "unclear" if 0.05 β€ margin < 0.15 | Otherwise returns the top predicted class
Output:
Text: Does using namespaces make this object-oriented?
Result: oop
Max: oop (0.920), 2nd: procedural (0.040), Margin: 0.880
Text: If everything is technically procedural at runtime, do paradigms matter?
Result: procedural
Max: procedural (0.665), 2nd: oop (0.282), Margin: 0.383
Text: Is this code functional just because it uses lambdas?
Result: functional
Max: functional (0.761), 2nd: procedural (0.105), Margin: 0.656
Text: Does avoiding classes automatically make code procedural?
Result: unclear
Max: procedural (0.551), 2nd: oop (0.431), Margin: 0.120
Technical Classifier
Architecture: Length-aware gating ensemble -
CodeBERT + XGBoost (TF-IDF + handcrafted features)
β’ Short input (< 60 tokens): 80% CodeBERT + 20% XGBoost
β’ Medium input (60β150 tokens): 65% CodeBERT + 35% XGBoost
β’ Long input (> 150 tokens): 50% CodeBERT + 50% XGBoost
Classes: Same as conceptual
Output:
Input text: 'Query: memory overhead of pointers in c/c++\t"I\'m on a 64bit platform, so all memory adrs are 8 bytes.\n\nSo to get an estimate of the memory usage of an array, should I add 8 bytes to the sizeof(DATATYPE) for each entry in the array.\n\nExample:\n\nshort unsigned int *ary = new short unsigned int[1000000]; //length 1mio\n//sizeof(short unsinged int) = 2bytes \n//sizeof(short unsinged int*) = 8 bytes\n\n\nSo does each entry take up 10bytes? and will my 1mio length array therefore use atleast 10megabytes?\n\nthanks\n\n'
============================================================
DEBUG: Model Outputs
============================================================
Token length: 151
Weights: CB=0.50, XGB=0.50
CodeBERT class probabilities:
Functional : 0.0002
Non-Paradigm : 0.0002
Oop : 0.0002
Procedural : 0.9994
β Predicted: Procedural
XGBoost class probabilities:
Functional : 0.0018
Non-Paradigm : 0.0016
Oop : 0.0018
Procedural : 0.9947
β Predicted: Procedural
Ensemble class probabilities:
Functional : 0.0010
Non-Paradigm : 0.0009
Oop : 0.0010
Procedural : 0.9971 β FINAL
Project Structure
krkn-assi/
concept-classifier/ Ready to use (BAAI + SVM)
inference.py
svm_classifier.pkl
sentence_model_name.txt
technical-classifier/ Ready to use (CodeBERT + XGBoost)
inference.py
codebert_model/
xgboost_model.pkl
tfidf_vectorizer.pkl
training/ Training scripts (optional)
conceptual/
technical/
README.md
requirements.txt
Performance
Concept Classifier
Splitting data (80/20 train/test split)...
Training samples: 196
Test samples: 50
Training set class distribution:
functional 52
none 24
oop 68
procedural 52
Name: count, dtype: int64
Test Accuracy: 0.9600 (96.00%)
Classification Report:
======================================================================
precision recall f1-score support
functional 1.00 0.92 0.96 13
none 1.00 1.00 1.00 6
oop 0.90 1.00 0.95 18
procedural 1.00 0.92 0.96 13
accuracy 0.96 50
macro avg 0.97 0.96 0.97 50
weighted avg 0.96 0.96 0.96 50
Confusion Matrix:
======================================================================
functional none oop procedural
functional 12 0 1 0
none 0 6 0 0
oop 0 0 18 0
procedural 0 0 1 12
(Rows = True labels, Columns = Predicted labels)
Performing 5-fold cross-validation...
Cross-validation scores: [0.9 0.91836735 0.97959184 0.83673469 0.79591837]
Mean CV accuracy: 0.8861 (+/- 0.1282)
Technical Classifier
Data: 57k train β 19.6k balanced samples | Accuracy: XGBoost 90.51%, CodeBERT 94.72%, Ensemble 95.25% test
Architecture:
- XGBoost: TF-IDF (1k) + 10 features (keywords, structure) | 200 trees
- CodeBERT: microsoft/codebert-base | 3 epochs, 27 min | FP16
- Ensemble: length aware soft voting
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| Functional | 0.88 | 0.96 | 0.92 | 1,352 |
| Non-Paradigm | 0.96 | 0.95 | 0.96 | 4,539 |
| OOP | 0.97 | 0.94 | 0.95 | 5,379 |
| Procedural | 0.97 | 0.99 | 0.98 | 995 |
CodeBERT training: Epoch 1: 92.64% β Epoch 2: 93.66% β Epoch 3: 94.72%
Model Separability & Embedding Quality
BAAI/bge-base-en-v1.5 Embedding Space Analysis:
The 246-sample conceptual dataset achieves great class separability in embedding space:
| Class Pair | Separation Distance |
|---|---|
| None β OOP | 0.4198 |
| None β Procedural | 0.3467 |
| OOP β Procedural | 0.3355 |
| Functional β None | 0.3370 |
| Functional β OOP | 0.2957 |
| Functional β Procedural | 0.2511 |