Commit
·
79f114e
1
Parent(s):
5d40291
feat: Add citation
Browse files
app.py
CHANGED
|
@@ -42,13 +42,12 @@ available](https://scandeval.com).
|
|
| 42 |
The generative models are evaluated using in-context learning with few-shot prompts.
|
| 43 |
The few-shot examples are sampled randomly from the training split, and we benchmark
|
| 44 |
the models 10 times with bootstrapped test sets and different few-shot examples in each
|
| 45 |
-
iteration. This allows us to better measure the uncertainty of the results.
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
percentage of other models that a model beats on a task
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
model significantly beats another model.
|
| 52 |
|
| 53 |
## The Benchmark Datasets
|
| 54 |
|
|
@@ -104,6 +103,22 @@ classification, we use the probabilities of the answer letter (a, b, c or d) to
|
|
| 104 |
the answer. The datasets in this task are machine translated versions of the
|
| 105 |
[HellaSwag](https://rowanzellers.com/hellaswag/) dataset. We use the Matthews
|
| 106 |
Correlation Coefficient (MCC) as the evaluation metric.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
"""
|
| 108 |
|
| 109 |
|
|
|
|
| 42 |
The generative models are evaluated using in-context learning with few-shot prompts.
|
| 43 |
The few-shot examples are sampled randomly from the training split, and we benchmark
|
| 44 |
the models 10 times with bootstrapped test sets and different few-shot examples in each
|
| 45 |
+
iteration. This allows us to better measure the uncertainty of the results. We use the
|
| 46 |
+
uncertainty in the radial plot when we compute the win ratios (i.e., the percentage of
|
| 47 |
+
other models that a model beats on a task). Namely, we compute the win ratio as the
|
| 48 |
+
percentage of other models that a model _significantly_ beats on a task, where we use a
|
| 49 |
+
paired t-test with a significance level of 0.05 to determine whether a model
|
| 50 |
+
significantly beats another model.
|
|
|
|
| 51 |
|
| 52 |
## The Benchmark Datasets
|
| 53 |
|
|
|
|
| 103 |
the answer. The datasets in this task are machine translated versions of the
|
| 104 |
[HellaSwag](https://rowanzellers.com/hellaswag/) dataset. We use the Matthews
|
| 105 |
Correlation Coefficient (MCC) as the evaluation metric.
|
| 106 |
+
|
| 107 |
+
|
| 108 |
+
## Citation
|
| 109 |
+
|
| 110 |
+
If you use the ScandEval benchmark in your work, please cite [the
|
| 111 |
+
paper](https://aclanthology.org/2023.nodalida-1.20):
|
| 112 |
+
|
| 113 |
+
```
|
| 114 |
+
@inproceedings{nielsen2023scandeval,
|
| 115 |
+
title={ScandEval: A Benchmark for Scandinavian Natural Language Processing},
|
| 116 |
+
author={Nielsen, Dan},
|
| 117 |
+
booktitle={Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)},
|
| 118 |
+
pages={185--201},
|
| 119 |
+
year={2023}
|
| 120 |
+
}
|
| 121 |
+
```
|
| 122 |
"""
|
| 123 |
|
| 124 |
|