YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

ο»Ώ# Programming Paradigm Classifier

Two specialized machine learning classifiers for identifying programming paradigms in natural language queries:

  • Conceptual Classifier: For short, natural language questions about programming concepts
  • Technical Classifier: For long, code-heavy StackOverflow-style questions with technical details

Check logs directory for detailed run of experiments using each of these classifiers


When to Use Each Classifier

Classifier Input Type Use Case Example
Conceptual Short (10-50 words), natural language, no code Design decisions, patterns, paradigm questions "How do I make this function pure?"
Technical Long (100-250 tokens), title+body, with code/errors StackOverflow-style debugging, implementation "What happens in heap when class A inherits B?" + [code]

Why two classifiers?

  • Different signals: Conceptual uses semantic meaning, Technical uses code patterns + syntax
  • Different models: Conceptual = BAAI embeddings + SVM , Technical = CodeBERT + XGBoost
  • Open Ended PS: The problem statement did not specify the kind of natural language queries that will be given to the system, so I built it for both kind of queries - simple and short / long and technical

Datasets

Classifier Source Size Characteristics
Conceptual Curated synthetic questions 246 samples Short (10-50 words), natural language, paradigm-focused, no code
Technical StackOverflow 57,235 train Long format (100-250 tokens), code snippets, errors, real-world posts

Rationale: Conceptual uses dedicated dataset to capture paradigm phrasing without code noise. Technical uses StackOverflow for authentic debugging/implementation queries with rich technical context.


Quick Start

1. Install

pip install -r requirements.txt

2. Run Inference

Conceptual Classifier:

append your queries in the test_texts lists of concept-classifier/

cd concept-classifier
python inference.pyinference.py

Interactive mode:

just run this and keep querying in the command line as long as you like, one at a time bash python inference.py --interactive

Technical Classifier:

Change the variable text input in main of technical-classifier/inference.py

cd technical-classifier
python inference.py

Inference

Concept Classifier

Architecture: BAAI/bge-base-en-v1.5 (768D) β†’ Linear SVM β†’ Calibrated probabilities

Classes: functional, oop, procedural, none

Uncertainty: Returns "X or Y" if margin < 0.05 | Returns "unclear" if 0.05 ≀ margin < 0.15 | Otherwise returns the top predicted class

Output:

Text: Does using namespaces make this object-oriented?
Result: oop
Max: oop (0.920), 2nd: procedural (0.040), Margin: 0.880

Text: If everything is technically procedural at runtime, do paradigms matter?
Result: procedural
Max: procedural (0.665), 2nd: oop (0.282), Margin: 0.383

Text: Is this code functional just because it uses lambdas?
Result: functional
Max: functional (0.761), 2nd: procedural (0.105), Margin: 0.656

Text: Does avoiding classes automatically make code procedural?
Result: unclear
Max: procedural (0.551), 2nd: oop (0.431), Margin: 0.120

Technical Classifier

Architecture: Length-aware gating ensemble -

CodeBERT + XGBoost (TF-IDF + handcrafted features)

β€’ Short input (< 60 tokens): 80% CodeBERT + 20% XGBoost

β€’ Medium input (60–150 tokens): 65% CodeBERT + 35% XGBoost

β€’ Long input (> 150 tokens): 50% CodeBERT + 50% XGBoost

Classes: Same as conceptual

Output:

Input text: 'Query: memory overhead of pointers in c/c++\t"I\'m on a 64bit platform, so all memory adrs are 8 bytes.\n\nSo to get an estimate of the memory usage of an array, should I add 8 bytes to the sizeof(DATATYPE) for each entry in the array.\n\nExample:\n\nshort unsigned int *ary = new short unsigned int[1000000]; //length 1mio\n//sizeof(short unsinged int) = 2bytes \n//sizeof(short unsinged int*) = 8 bytes\n\n\nSo does each entry take up 10bytes? and will my 1mio length array therefore use atleast 10megabytes?\n\nthanks\n\n'

============================================================
DEBUG: Model Outputs
============================================================
Token length: 151
Weights: CB=0.50, XGB=0.50

CodeBERT class probabilities:
  Functional     : 0.0002
  Non-Paradigm   : 0.0002
  Oop            : 0.0002
  Procedural     : 0.9994
  β†’ Predicted: Procedural

XGBoost class probabilities:
  Functional     : 0.0018
  Non-Paradigm   : 0.0016
  Oop            : 0.0018
  Procedural     : 0.9947
  β†’ Predicted: Procedural

Ensemble class probabilities:
  Functional     : 0.0010
  Non-Paradigm   : 0.0009
  Oop            : 0.0010
  Procedural     : 0.9971 ← FINAL

Project Structure

krkn-assi/
 concept-classifier/       Ready to use (BAAI + SVM)
    inference.py
    svm_classifier.pkl
    sentence_model_name.txt

 technical-classifier/        Ready to use (CodeBERT + XGBoost)
    inference.py
    codebert_model/
    xgboost_model.pkl
    tfidf_vectorizer.pkl

 training/                    Training scripts (optional)
    conceptual/
    technical/

 README.md
 requirements.txt

Performance

Concept Classifier

Splitting data (80/20 train/test split)...
Training samples: 196
Test samples: 50

Training set class distribution:
functional    52
none          24
oop           68
procedural    52
Name: count, dtype: int64


Test Accuracy: 0.9600 (96.00%)

Classification Report:
======================================================================
              precision    recall  f1-score   support

  functional       1.00      0.92      0.96        13
        none       1.00      1.00      1.00         6
         oop       0.90      1.00      0.95        18
  procedural       1.00      0.92      0.96        13

    accuracy                           0.96        50
   macro avg       0.97      0.96      0.97        50
weighted avg       0.96      0.96      0.96        50


Confusion Matrix:
======================================================================
            functional  none  oop  procedural
functional          12     0    1           0
none                 0     6    0           0
oop                  0     0   18           0
procedural           0     0    1          12

(Rows = True labels, Columns = Predicted labels)

Performing 5-fold cross-validation...
Cross-validation scores: [0.9 0.91836735 0.97959184 0.83673469 0.79591837]
Mean CV accuracy: 0.8861 (+/- 0.1282)

Technical Classifier

Data: 57k train β†’ 19.6k balanced samples | Accuracy: XGBoost 90.51%, CodeBERT 94.72%, Ensemble 95.25% test

Architecture:

  1. XGBoost: TF-IDF (1k) + 10 features (keywords, structure) | 200 trees
  2. CodeBERT: microsoft/codebert-base | 3 epochs, 27 min | FP16
  3. Ensemble: length aware soft voting
Class Precision Recall F1 Support
Functional 0.88 0.96 0.92 1,352
Non-Paradigm 0.96 0.95 0.96 4,539
OOP 0.97 0.94 0.95 5,379
Procedural 0.97 0.99 0.98 995

CodeBERT training: Epoch 1: 92.64% β†’ Epoch 2: 93.66% β†’ Epoch 3: 94.72%


Model Separability & Embedding Quality

BAAI/bge-base-en-v1.5 Embedding Space Analysis:

The 246-sample conceptual dataset achieves great class separability in embedding space:

Class Pair Separation Distance
None ↔ OOP 0.4198
None ↔ Procedural 0.3467
OOP ↔ Procedural 0.3355
Functional ↔ None 0.3370
Functional ↔ OOP 0.2957
Functional ↔ Procedural 0.2511
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support