The headline that was within noise

2026-05-31T00:00:00+00:00

Before running the ablation on my retrieval pilot, I wrote down three predictions about which of the three arms — dense vector search, lexical BM25, or their reciprocal-rank fusion — would retrieve best. Writing them first was the useful part: it stopped me from reading the results as a ranking when the data could not support one.

No ranking the data supports

The obvious question — which arm wins Recall@10 — has a numerical answer and no defensible one. Across the same twenty-nine hand-reviewed queries from the previous note, where Recall@10, MRR and reciprocal-rank fusion are defined, dense search has the highest Recall@10 at 0.724 and the hybrid the lowest at 0.690. But the paired-difference intervals reported in the ablation all cross zero¹: dense over hybrid is +0.034, with an interval from −0.069 to +0.121. The two metrics do not even agree on direction — the hybrid is lowest on Recall@10 and highest on MRR, and that gap is within noise too. At twenty-nine queries there is no ordering to report.

My three predictions were that lexical search would beat the hybrid on queries with one strong single-arm answer, that dense would beat lexical on broad topic questions, and that the hybrid would win on average while losing on the tails. The middle one held in direction only — dense edged lexical on the broad queries, 0.625 to 0.583, well inside its interval. The “wins on average” prediction is exactly the averaged claim this sample cannot settle. The first prediction is the one that survived, and it survived as a single query.

Two results that survive a small sample

First, the depth-ten ranking does not survive to depth fifty. Measured at Recall@50 the order inverts: the hybrid moves from last to first, 0.966, and dense from first to last, 0.897. I am not claiming the hybrid significantly wins at depth — same small sample — but a ranking that flips when you move the cutoff is not one to trust at either cutoff.

Second, and more durable because it does not lean on an average, is query 12. Its landmark paper for WASP-96b sits at rank 4 in lexical search and rank 338 in dense — effectively invisible to the dense arm. Fusion then buries it anyway. Reciprocal-rank fusion at k=60 rewards papers that both arms return: a paper one arm ranks, say, 8th and the other 15th scores about 0.028 (1/68 + 1/75), enough to outrank a paper only one arm rates highly — like the WASP-96b result, about 0.016 from its lexical rank of 4. So for query 12 a paper lexical search alone would have placed fourth never reaches the hybrid’s top ten. That one query is the cleanest case for keeping a lexical arm, and a concrete reminder that fusion can bury a paper exactly one arm is sure of.

What I will actually change

The complementarity is real but thin. Pool the top fifty from each arm and 42 of the 49 relevant documents are found by both; only five are unique to one arm, and of those only query 12 is genuinely blind to the other — the rest sit just past the cutoff, at ranks in the fifties to nineties, reachable rather than missed. Two relevant papers neither arm surfaced at depth fifty at all.

I am keeping the hybrid as the stage-one candidate generator regardless. The job of stage one is to land the relevant papers somewhere in the pool a reranker will read, not to order them perfectly — so the number I should be optimising is candidate-set recall at the pool depth, not Recall@10. Recall@10 was the wrong target for a stage that feeds a second stage; it was measuring the reranker’s job before the reranker existed.

This is all still 500 abstracts. The question I am carrying into the corpus-widening work is whether the dense-blind class — papers like query 12’s — grows as the corpus grows, or stays a handful of edge cases. That is the next note.

Each interval comes from ten thousand bootstrap resamples of the per-query scores (seed 20260531). An interval that spans zero means the data are consistent with no difference between the arms. ↩

The label review that lowered my score

2026-05-31T00:00:00+00:00

The first pass at my retrieval pilot scored Recall@10 = 0.812 over sixteen graded queries, against 500 real exoplanet-atmosphere abstracts. Then I reviewed the relevance labels those scores were measured against, and the figure fell to 0.690 across twenty-nine queries. The drop is the part of the exercise I trust.

What the pilot does

AstroLLM is meant to cite real papers instead of inventing them, which makes retrieval the grounding layer: given an astronomy question, return the abstracts most likely to answer it. The pilot corpus is 500 abstracts from NASA ADS — the query abs:"exoplanet atmosphere", 2018 onward, 500 of 4,845 matches. Two engines run in parallel and are then fused: a dense vector search (BGE-small, 384 dimensions, in pgvector) that matches on meaning, and a lexical BM25 index (SQLite FTS5) that matches on words. Reciprocal-rank fusion¹ merges their two ranked lists. I grade the merged list with Recall@10² and MRR³.

Why the number moved down

Three things happened in the review, and only the first was flattering. Re-reading query 06, I found I’d labelled a transmission-spectrum paper relevant to a question about dayside thermal emission — a different observable. Removing it pushed the headline up to 0.844. Had I stopped there, the review would have “confirmed” a score better than the one I started with, which is exactly the kind of review to distrust.

Continuing honestly pulled it back. I graded the conceptual queries I’d skipped on the first pass, and I stopped rounding away real misses: query 11 (WASP-121b) has no correct paper in its top ten, which is a 0.00 and belongs in the denominator. Twenty-nine of thirty queries ended up scored — one had no correct paper anywhere in the corpus — and the corpus-wide figure settled at Recall@10 0.690 and MRR 0.623.

Labels track relevance, not rank

The rule I held to: a label records whether a paper answers the query, not whether the ranker found it. Query 08’s top fused result — ranked first by the system — got marked irrelevant, because the abstract did not address what was asked. Queries 03, 05, and 18 stayed at 0.50, genuine partial misses I could have rationalised away. One correction went the other way: I’d overlooked that a sodium-detection paper was a valid second answer for query 07, which raised its reciprocal rank from 0.17 to 1.00. Corrections in both directions, judged from the abstract, never from where the ranker had placed it.

Where it is actually weak

Split by query type, the result that matters shows up. On named-target queries — a specific planet, a specific measurement — Recall@10 is 0.794. On broad known-item queries — topic-shaped questions whose answer key is one or two landmark papers — it is 0.542. Lexical search is good with identifiers like WASP-39b; dense search handles paraphrase; the current hybrid first-stage baseline is weaker on topic-shaped questions with no single string to match. A cross-encoder reranker is the next hypothesis for closing that gap, because it reads the query and abstract together instead of scoring them apart.

Two caveats keep me honest about the 0.542. There are only twelve broad known-item queries and seventeen named-target ones — small on both sides. And because those answer keys are landmark papers rather than exhaustive topical relevance, 0.542 is the optimistic reading; true topical recall is likely worse. The corpus is 500 abstracts, too: an earlier 14-document synthetic fixture scored a perfect 1.000/1.000, which only demonstrates that a number without a real, confusable corpus is theatre.

Before building the reranker, though, I wanted to know whether hybrid fusion was even the right first-stage baseline. So I ran an ablation, switching each engine off in turn to see what it contributed. That is the next note.

Reciprocal-rank fusion — combine two ranked lists using only each document’s rank in each, scoring 1 ÷ (60 + rank) and summing. Because it ignores the raw similarity scores, it needs no normalisation between the dense and lexical engines. ↩
Recall@10 — of the papers labelled relevant for a query, the fraction that land in the top ten results, averaged over queries. With a single target paper per query it reduces to whether the right paper made the top ten. ↩
MRR, mean reciprocal rank — 1 ÷ (rank of the first relevant hit), averaged over queries: first place scores 1.0, third place 0.33, not found 0. It rewards ranking the answer high rather than merely surfacing it. ↩

Nandan Joshi — Notes