Buckets:

DylanJHJ
/

APRIL

4.35 GB

17,662 files

Updated about 16 hours ago

Ctrl+K

Name	Size	Uploaded	Xet hash
.claude		about 16 hours ago	1 items
__pycache__		about 16 hours ago	3 items
autoqrels		about 16 hours ago	420 items
autoqrels-full		about 16 hours ago	210 items
autoqrels-quantile		about 16 hours ago	231 items
autoqrels-rand		about 16 hours ago	350 items
autoqrels-rank		about 16 hours ago	875 items
dense-70b		about 16 hours ago	8,433 items
dense-7b		about 16 hours ago	26 items
eval_results		about 16 hours ago	420 items
eval_results-full		about 16 hours ago	210 items
eval_results-quantile		about 16 hours ago	301 items
eval_results-rand		about 16 hours ago	350 items
eval_results-rank		about 16 hours ago	625 items
README.md	5.73 kB xet	about 16 hours ago	6e7c4d67
beir_stat.md	1.34 kB xet	about 16 hours ago	f7fbc885
diverse_pooling.py	2.95 kB xet	about 16 hours ago	fa427221
eval_autoqrels.py	17.3 kB xet	about 16 hours ago	4a242295
eval_autoqrels_old.py	16.7 kB xet	about 16 hours ago	e5c3b78c
eval_autoqrels_sample.py	14 kB xet	about 16 hours ago	d9ac16fa
get_beir_stats.py	1.86 kB xet	about 16 hours ago	74e13142
output_autoqrel.py	4.42 kB xet	about 16 hours ago	acd7fe7a
qrel_stats.py	8.78 kB xet	about 16 hours ago	4c5beff4
sanity_check_first_stage.py	7.82 kB xet	about 16 hours ago	3ca49c10
sanity_check_pooling.py	9.8 kB xet	about 16 hours ago	e37c3442
sanity_check_pooling_dense.py	9.8 kB xet	about 16 hours ago	e37c3442

README.md

qrel-analysis

A pipeline for evaluating how well LLM-judge-derived relevance judgments (qrels) align with human judgments, across multiple thresholding strategies and retrieval systems.

Overview

This pipeline:

Takes LLM judge scores (e.g., reranking scores) and converts them into binary qrels using various thresholding strategies
Evaluates a retrieval run against those derived qrels using nDCG@10
Aggregates results across datasets and judge systems into pivot CSVs

Pipeline

Human QRels + LLM Judge Runs + Retrieval Runs
            ↓
    eval_autoqrels.py          ← convert judge scores to qrels, evaluate
            ↓
    Per-strategy nDCG@10 results (JSONL)
            ↓
    collect_results.py         ← aggregate across judges/datasets
            ↓
    Pivot CSVs (one per retriever)

Scripts

1. `eval_autoqrels.py` — Core evaluation

Converts LLM judge scores to binary qrels using a thresholding strategy, then evaluates a retrieval run against those qrels.

Arguments:

Argument	Required	Default	Description
`--dataset_name`	Yes	—	Dataset name (e.g., `beir/nfcorpus/test`)
`--loader_type`	No	`irds`	Loader module type
`--judge_run`	Yes	—	Path to LLM judge JSONL file (used as qrel source)
`--evaluate_run`	No	same as `judge_run`	Path to retrieval run JSONL to evaluate
`--strategies`	No	—	One or more strategies (see below); repeatable
`--threshold`	No	`0.5`	Score cutoff for `thresholding` strategy
`--rank_cutoff`	No	`10`	Top-k docs treated as relevant for `rank` strategy
`--gap_k`	No	`1`	k-th largest score gap for `largest_gap` strategy
`--quantile_cutoff`	No	`0.75`	Quantile threshold for `quantile` strategy
`--min_relevance`	No	`1`	Min human relevance grade for oracle strategies
`--exp`	No	—	Optional experiment tag added to output records

Thresholding strategies:

Strategy	Oracle?	Description
`direct`	No	Round LLM scores to nearest integer
`thresholding`	No	Binary threshold at `--threshold` (default 0.5)
`rank`	No	Top-k documents are relevant (k = `--rank_cutoff`)
`largest_gap`	No	Threshold at the k-th largest score gap
`quantile`	No	Score >= q-th percentile is relevant
`optimal_per_topic`	Yes	Per-topic threshold maximizing F1 vs human qrels
`optimal_global`	Yes	Single global threshold maximizing macro-avg F1
`all`	—	Apply all strategies at once

Input format (both --judge_run and --evaluate_run): JSONL, one record per line:

{"qid": "query_id", "docid": "doc_id", "score": 3.14}

Output format: JSONL to stdout, one record per strategy:

{"dataset": "beir/nfcorpus/test", "exp": "judge:eval", "strategy": "direct", "nDCG@10": 0.5559}

Example usage:

# Evaluate using a single strategy
python eval_autoqrels.py \
  --dataset_name beir/nfcorpus/test \
  --judge_run /path/to/judge_run.jsonl \
  --evaluate_run /path/to/bm25_run.jsonl \
  --strategies rank

# Multiple strategies
python eval_autoqrels.py \
  --dataset_name beir/nfcorpus/test \
  --judge_run /path/to/judge_run.jsonl \
  --evaluate_run /path/to/bm25_run.jsonl \
  --strategies direct --strategies rank --strategies largest_gap

# All strategies with experiment tag, save to file
python eval_autoqrels.py \
  --dataset_name beir/nfcorpus/test \
  --judge_run /path/to/judge_run.jsonl \
  --evaluate_run /path/to/bm25_run.jsonl \
  --strategies all \
  --exp "my_experiment" \
  > results/my_judge/raaj-nfcorpus.jsonl

2. `collect_results.py` — Aggregate results

Scans result directories for JSONL files (*/raaj*.jsonl), parses them, and produces pivot-table CSVs grouped by retrieval system.

Arguments:

Argument	Required	Description
`--results_dir`	Yes	Base directory containing per-judge subdirectories
`--output_dir`	Yes	Directory to write one CSV per retrieval system

Expected input directory structure:

results_dir/
├── bm25-rerank-judge/
│   ├── raaj-nfcorpus.jsonl
│   ├── raaj-trec-covid.jsonl
│   └── ...
├── colbert-small-rerank-judge/
│   └── ...
└── splade-v3-rerank-judge/
    └── ...

Output: One CSV per retrieval system (e.g., bm25.csv, colbert-small.csv), pipe-delimited pivot tables comparing nDCG@10 across judge systems and strategies.

Example usage:

python collect_results.py \
  --results_dir ./results \
  --output_dir ./new

3. `get_beir_stats.py` — Dataset statistics

Prints a tab-separated statistics table for BEIR and MS MARCO datasets. No arguments needed.

Datasets covered: msmarco-passage/trec-dl-2019, trec-dl-2020, and 13 BEIR datasets (arguana, climate-fever, dbpedia-entity, fever, fiqa, hotpotqa, nfcorpus, nq, quora, scidocs, scifact, trec-covid, webis-touche2020).

Statistics reported: num queries, corpus size, total judgments, avg query/doc length, avg positives/negatives per query, judgment level range.

python get_beir_stats.py

Results Directory

Current results are in results/, organized by {retriever}-rerank-{judge}/, e.g.:

bm25-rerank-judge/
colbert-small-rerank-judge/
splade-v3-rerank-judge/
nomicai-modernbert-embed-rerank-judge/
qwen3-embed-600m-rerank-judge/

Each subdirectory contains raaj-{dataset}.jsonl files.

Dependencies

autollmrerank — internal module for dataset loading (loader_dev.irds)
ir_measures — IR evaluation metrics
pandas — result aggregation

Total size: 4.35 GB

Files: 17,662

Last updated: Jun 15

Pre-warmed CDN: US EU US EU

qrel-analysis

Overview

Pipeline

Scripts

1. eval_autoqrels.py — Core evaluation

2. collect_results.py — Aggregate results

3. get_beir_stats.py — Dataset statistics

Results Directory

Dependencies

Contributors

1. `eval_autoqrels.py` — Core evaluation

2. `collect_results.py` — Aggregate results

3. `get_beir_stats.py` — Dataset statistics