| --- |
| language: |
| - en |
| tags: |
| - regression |
| - similarity |
| - sql |
| - natural-language |
| - reward-model |
| license: mit |
| datasets: |
| - custom |
| metrics: |
| - mse |
| - mae |
| - rmse |
| model-index: |
| - name: BERT Reward Model for CoT Filtering |
| results: |
| - task: |
| type: regression |
| name: Similarity Score Prediction |
| dataset: |
| name: Custom CoT Dataset |
| type: custom |
| metrics: |
| - type: mse |
| value: 0.0238 |
| - type: mae |
| value: 0.1229 |
| - type: rmse |
| value: 0.1543 |
| --- |
| |
| # BERT Reward Model for CoT Filtering |
|
|
| A BERT-based regression model fine-tuned to predict similarity scores between SQL queries, reasoning chains (Chain-of-Thought), and natural language descriptions. |
|
|
| ## Model Description |
|
|
| This model is based on `bert-base-uncased` and has been fine-tuned for regression to predict similarity scores in the range [0, 1]. The model takes as input a concatenation of: |
| - SQL query |
| - Reasoning/Chain-of-Thought explanation |
| - Predicted natural language description |
|
|
| And outputs a similarity score indicating how well the predicted NL matches the ground truth. |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| import torch |
| |
| # Load model and tokenizer |
| |
| tokenizer = AutoTokenizer.from_pretrained("DarianNLP/bert_sequel_beagles") |
| model = AutoModelForSequenceClassification.from_pretrained( |
| "DarianNLP/bert_sequel_beagles", |
| num_labels=1, |
| problem_type="regression" |
| ) |
| model.eval() |
| |
| # Prepare input |
| sql = "SELECT movie_title FROM movies WHERE movie_release_year = 1945" |
| reasoning = "think: The SQL selects the movie title..." |
| predicted_nl = "What was the most popular movie released in 1945?" |
| |
| input_text = f"SQL: {sql}\nReasoning: {reasoning}\nNL: {predicted_nl}" |
| |
| # Tokenize and predict |
| inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512) |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| # Apply sigmoid to get probability |
| similarity_score = torch.sigmoid(outputs.logits).item() |
| |
| print(f"Predicted similarity: {similarity_score:.3f}") |
| ``` |
|
|
| ## Training Details |
|
|
| - **Base Model**: bert-base-uncased |
| - **Training Dataset**: Custom CoT dataset with corruptions (7,342 examples) |
| - **Train/Val/Test Split**: 75% / 12.5% / 12.5% |
| - **Training Loss**: MSE (Mean Squared Error) |
| - **Evaluation Metrics**: |
| - MSE: 0.0238 |
| - MAE: 0.1229 |
| - RMSE: 0.1543 |
|
|
| ## Limitations |
|
|
| - Maximum input length: 512 tokens (BERT's limit) |
| - Trained on a specific domain (SQL to NL translation with CoT) |
| - Performance may vary on out-of-domain data |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @misc{bert_cot_reward_model, |
| title={BERT Reward Model for Chain-of-Thought Filtering}, |
| author={Darian Lee}, |
| year={2025}, |
| } |
| ``` |
|
|
|
|