<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator>
  <link href="https://nandan.me/writing/notes/feed.xml" rel="self" type="application/atom+xml" />
  <link href="https://nandan.me/writing/notes/" rel="alternate" type="text/html" />
  <updated>2026-06-02T12:12:05+00:00</updated>
  <id>https://nandan.me/writing/notes/feed.xml</id>
  <title type="html">Nandan Joshi — Notes</title>
  <subtitle>Shorter informal pieces — paper reading notes, conference reflections, project-status updates.</subtitle>
  
  
  
    <entry>
      <title type="html">The deeper pool that recovered nothing</title>
      <link href="https://nandan.me/writing/notes/deeper-pool-that-recovered-nothing/" rel="alternate" type="text/html" title="The deeper pool that recovered nothing" />
      <published>2026-06-01T00:00:00+00:00</published>
      <updated>2026-06-01T00:00:00+00:00</updated>
      <id>https://nandan.me/writing/notes/deeper-pool-that-recovered-nothing/</id>
      <content type="html" xml:base="https://nandan.me/writing/notes/deeper-pool-that-recovered-nothing/"><![CDATA[<p><a href="/writing/notes/widening-that-lowered-every-score/">The last note</a> ended on a question: when the corpus grew, the relevant papers were still in the candidate pool, but the fused ranking had stopped surfacing them — was that recall lost, or just sitting below a pool fixed too shallow? A deeper pool was the cheap thing to try, so I swept it: 50, 100, 200, 500 candidates per arm, on the same frozen index, everything else held fixed. It recovered nothing where it mattered: the fused top-10 did not move by a single query, even as the candidate union climbed from 0.948 to a perfect 1.000.</p>

<!--more-->

<figure class="figure">
  <img src="/images/notes/astrollm-pool-depth.svg" alt="A retrieval pipeline — dense vector search and lexical BM25 feeding a candidate union, RRF fusion, and the top-10 results a reader sees, with a dashed not-yet-built second-stage reranker — shown beside a pool-depth sweep from 50 to 500 candidates per arm. Candidate-union recall climbs from 0.948 to 1.000 while the fused top-10 stays flat at 0.592 and the fused top-50 moves only slightly and non-monotonically." />
  <figcaption>
    The pool-depth sweep on the frozen 2,500-abstract index. As the pool deepens from 50 to 500 candidates per arm, the candidate union rises to a perfect 1.000 against the frozen labels, while the fused top-10 a reader sees does not move by a single query — 0.592 at every depth — and the fused top-50 shifts only slightly, and not monotonically. The relevant papers are in the pool; the ranking is what keeps them off the top of the list, which is why the next lever is a second-stage reranker rather than a deeper pool.
  </figcaption>
</figure>

<h2 id="what-i-changed">What I changed</h2>

<p>This is the mirror image of the widening experiment. There I changed the corpus and held the method fixed; here I change one number — the per-arm candidate pool depth — and hold the corpus, the embedding, the index, and the fusion code fixed. No re-ingest, no re-embed, no re-index: the 2,500-abstract index from the last note is queried as-is at four depths. The metrics are the ones from <a href="/writing/notes/label-review-that-lowered-my-score/">the first note</a>.</p>

<p>Two guards matter. The harness refuses to write results unless pool 50 reproduces the last note’s numbers exactly, so the pipeline cannot have drifted. And the scored cutoff — top-10 and top-50 — is held fixed while the pool varies, so “fused top-50 recall at pool 200” is measurable; that decoupling is the whole point. I also folded in the deferred reporting split, separating named targets, scored as recall, from topical ones, reported as coverage — by a rule committed before any of these numbers existed.</p>

<h2 id="what-depth-bought">What depth bought</h2>

<p>Across all 29 queries:</p>

<table>
  <thead>
    <tr>
      <th>pool</th>
      <th>candidate union</th>
      <th>fused top-10</th>
      <th>fused top-50</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>50</td>
      <td>0.948</td>
      <td>0.592</td>
      <td>0.793</td>
    </tr>
    <tr>
      <td>100</td>
      <td>0.966</td>
      <td>0.592</td>
      <td>0.862</td>
    </tr>
    <tr>
      <td>200</td>
      <td>0.966</td>
      <td>0.592</td>
      <td>0.828</td>
    </tr>
    <tr>
      <td>500</td>
      <td>1.000</td>
      <td>0.592</td>
      <td>0.845</td>
    </tr>
  </tbody>
</table>

<p>The union — what the two arms find between them — rises with depth, as it must: a deeper pool admits more candidates. The fused top-10 is the flat column. It is 0.592 at every depth; not one query’s top-10 changes between pool 50 and pool 500. The fused top-50 barely moves, and not even monotonically — it rises at pool 100, then falls back at 200. So the gap between what the candidate set contains and what the fusion surfaces does not close as the pool deepens. At pool 500 the candidate set holds every labelled paper, and the fused top-50 still leaves the same 0.155 it left at pool 50. The only real gain is the small fused top-50 bump at pool 100; past that, depth buys nothing.</p>

<h2 id="why-deeper-didnt-help">Why deeper didn’t help</h2>

<p>The last note asked whether sweeping the pool would recover the lost recall or merely relocate the question. It relocated it. The candidates are present; the fusion demotes them. A gold paper sitting mid-rank in both arms earns only small reciprocal-rank terms from RRF and lands below the cut no matter how many candidates are admitted around it.</p>

<p>At pool 500, seven queries have their relevant paper in the candidate set yet missing from the fused top-50. One is the WASP-96b paper that lexical alone surfaced in <a href="/writing/notes/headline-that-was-within-noise/">the ablation note</a> — still in the pool, still left out. Several of the others sit in <em>both</em> arms’ top-500 and are demoted anyway, the cleanest version of the problem: no single-arm blindness to blame. The complementarity counts do shift with depth — papers once exclusive to one arm migrate into both as the pool grows — but that redistribution does nothing for what fusion surfaces. The candidate set and the ranking are decoupled, and it is the ranking that binds.</p>

<h2 id="the-arm-edges-didnt-move-either">The arm edges didn’t move either</h2>

<p>I had a second prediction: as the pool deepened and fusion mattered less, the margins between arms would compress. They did not compress; they did not move at all. The dense−hybrid and lexical−hybrid Recall@10 margins, with their bootstrap intervals and leave-one-out checks, are identical at every depth.<sup id="fnref:ci" role="doc-noteref"><a href="#fn:ci" class="footnote" rel="footnote">1</a></sup> The fragile edge from the last note stays fragile — hybrid over dense survives dropping 25 of 29 queries and collapses on any of the same four — and the robust one stays robust: hybrid over lexical survives all 29. Those margins live in the top-10 fused ranking, which a deeper pool did not touch.</p>

<h2 id="the-lever">The lever</h2>

<p>So the lever is not the pool. Against the frozen labels, the candidate set is already saturated — a perfect 1.000 for the named queries at every depth, and 1.000 across all 29 by pool 500. What decides whether a user sees the relevant paper in the top ten is RRF’s ordering, which a deeper pool did not change at the top-10. The intervention that can turn candidate recall into top-k recall is a second stage that reranks the candidates a fixed pool already holds — the cross-encoder reranker I flagged as the next hypothesis in the first note. Whether it actually recovers the demoted papers is the next experiment, not a foregone conclusion.</p>

<p>If I stay on RRF in the meantime, the sweep sets an operating point: pool 100, which captures the one recoverable piece of fused recall — a query or two — at about 157 candidates per query. Pool 200 and 500 fuse three to seven hundred candidates and buy nothing, sometimes less. That is the whole return on a deeper pool, and the case for not spending the next week on a bigger net. The papers the labels call relevant are already in the net; the next thing worth building is the stage that reads them.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:ci" role="doc-endnote">
      <p>Paired-difference bootstrap, 10,000 resamples, seed 20260531; leave-one-out drops each query in turn and refits the interval to check it still excludes zero. <a href="#fnref:ci" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content>
      <author>
        <name>Nandan Joshi</name>
      </author>
      <category term="AstroLLM" /><category term="Retrieval" /><category term="Evaluation" /><category term="Fusion" />
      <summary type="html">When the corpus grew, the relevant papers stayed in the candidate pool but the fused ranking stopped surfacing them. I deepened the pool from 50 to 500 candidates per arm; the candidate union reached a perfect 1.000 and the fused top-10 did not change by a single query.</summary>
    </entry>
  
    <entry>
      <title type="html">The widening that lowered every score</title>
      <link href="https://nandan.me/writing/notes/widening-that-lowered-every-score/" rel="alternate" type="text/html" title="The widening that lowered every score" />
      <published>2026-06-01T00:00:00+00:00</published>
      <updated>2026-06-01T00:00:00+00:00</updated>
      <id>https://nandan.me/writing/notes/widening-that-lowered-every-score/</id>
      <content type="html" xml:base="https://nandan.me/writing/notes/widening-that-lowered-every-score/"><![CDATA[<p>I expected the bigger corpus to help. <a href="/writing/notes/headline-that-was-within-noise/">Post 2</a> closed on a falsifiable question — as the corpus grows, does the class of papers only one retrieval arm can find grow with it, or stay the singleton I had seen at 500 abstracts? To check, I widened the corpus five times over, held the method fixed, and re-ran the same three arms on the same queries with the same labels. Every score on those queries got worse. One thing did move in the architecture’s favour: the gap a second retrieval stage feeds on.</p>

<!--more-->

<h2 id="what-i-changed">What I changed</h2>

<p>The pilot drew 500 abstracts from a single ADS phrase query. I kept the query, the 2018 year floor, and the embedding, index, and fusion code all identical, and only took a deeper citation-ranked cut: 2,500 abstracts, with the original 500 preserved as a byte-identical subset. The metrics are the ones defined in <a href="/writing/notes/label-review-that-lowered-my-score/">the first note</a>.</p>

<p>One honest caveat up front: a deeper cut of the same query admits younger, less-cited papers, so corpus <em>size</em> and corpus <em>recency</em> move together here by construction. The 2,000 new abstracts skew hard towards 2024–2026. This experiment isolates the method, not those two axes from each other.</p>

<h2 id="what-got-worse">What got worse</h2>

<p>On the same 29 queries, with the same gold labels, every arm lost recall and rank. Recall@10 fell from 0.724 to 0.529 for dense, 0.713 to 0.437 for lexical, and 0.690 to 0.592 for hybrid; MRR and Recall@50 fell in step. The proximate mechanism is mechanical: 2,000 more papers push the relevant ones deeper, and a candidate pool fixed at 50 is now 2% of the index rather than 10%.</p>

<p>The arm ordering flips. In the pilot, hybrid had the <em>lowest</em> Recall@10; on the widened corpus it has the highest on every split and degrades least. Lexical collapses hardest — what you would expect when BM25’s OR-of-tokens matching meets five times as many token-matching distractors.</p>

<p>These are honest deltas, because the queries and labels did not change. What I did <em>not</em> do is re-review the gold set for the larger corpus, so these are not the widened corpus’s final recall — only its recall against the pilot’s labels.<sup id="fnref:relabel" role="doc-noteref"><a href="#fn:relabel" class="footnote" rel="footnote">1</a></sup> That relabel is deferred.</p>

<h2 id="what-got-better">What got better</h2>

<p>One thing moved the other way. The premium the pooled candidate set holds over dense alone — the recall a reranker would gain from seeing both arms rather than one — grew from +0.069 to +0.172 at Recall@50. In absolute terms the union barely moved, from 0.966 to 0.948, while each single arm fell much further, towards ~0.78–0.81. The class of papers only one arm catches grew from 5 to 15, in both directions.</p>

<p>So the case for a two-stage design — both arms feeding a later reranker — got stronger, not weaker, as the corpus grew. I want to be precise about how much. Hybrid beating lexical at Recall@10 is robust: the gap survives dropping any single query.<sup id="fnref:ci" role="doc-noteref"><a href="#fn:ci" class="footnote" rel="footnote">2</a></sup> Hybrid beating <em>dense</em> is not — that edge rests on four queries and dissolves if you drop any one of them, so I read it as suggestive only. The durable claim is the pool-level union premium, which does not rest on a handful of top-10 query wins. And the mechanism behind the growth was not the one I had guessed: the single-arm class grew because recent newcomers displace older relevant papers out of one arm’s pool, not because new papers arrive that one arm is blind to.</p>

<h2 id="a-premise-that-broke">A premise that broke</h2>

<p>I had expected widening to pull two earlier coverage misses into the corpus. It cannot, and not for lack of depth. One target’s abstracts never contain the exact phrase the query requires; the other’s canonical papers predate the 2018 floor. The whole phrase universe is 6,492 papers, and 2,500 already samples over half of the post-2018 slice. These misses are a query-recall problem, not a corpus-size one, and they need a different lever: widening the query, not the corpus.</p>

<h2 id="the-open-question">The open question</h2>

<p>The pilot had hybrid winning at depth 50; that inversion is gone. Hybrid’s fused top-50 now sits at 0.793 while the union it draws from holds 0.948 — the fusion is leaving recall on the table that the candidate set still contains. With the pool fixed at 2% of the index, this experiment cannot say whether that recall is genuinely lost or just sitting below a pool that is now too shallow; the pool was held fixed by design. Whether sweeping it recovers the recall or merely relocates the question is <a href="/writing/notes/deeper-pool-that-recovered-nothing/">the next note</a>.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:relabel" role="doc-endnote">
      <p>Recall is reported only where the relevant set is much smaller than the cutoff; where a named entity matches dozens of in-corpus papers, Recall@k stops measuring ranking and is reported as coverage instead. <a href="#fnref:relabel" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:ci" role="doc-endnote">
      <p>10,000 bootstrap resamples, seed 20260531; “robust” means the 95% paired-difference interval excludes zero under every leave-one-out drop. <a href="#fnref:ci" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content>
      <author>
        <name>Nandan Joshi</name>
      </author>
      <category term="AstroLLM" /><category term="Retrieval" /><category term="Evaluation" /><category term="Corpus" />
      <summary type="html">I made the corpus five times bigger expecting better coverage. Every score on the same queries got worse, and the gap I had already decided to build on grew wider.</summary>
    </entry>
  
    <entry>
      <title type="html">The headline that was within noise</title>
      <link href="https://nandan.me/writing/notes/headline-that-was-within-noise/" rel="alternate" type="text/html" title="The headline that was within noise" />
      <published>2026-05-31T00:00:00+00:00</published>
      <updated>2026-05-31T00:00:00+00:00</updated>
      <id>https://nandan.me/writing/notes/headline-that-was-within-noise/</id>
      <content type="html" xml:base="https://nandan.me/writing/notes/headline-that-was-within-noise/"><![CDATA[<p>Before running the ablation on my retrieval pilot, I wrote down three predictions about which of the three arms — dense vector search, lexical BM25, or their reciprocal-rank fusion — would retrieve best. Writing them first was the useful part: it stopped me from reading the results as a ranking when the data could not support one.</p>

<!--more-->

<h2 id="no-ranking-the-data-supports">No ranking the data supports</h2>

<p>The obvious question — which arm wins Recall@10 — has a numerical answer and no defensible one. Across the same twenty-nine hand-reviewed queries from <a href="/writing/notes/label-review-that-lowered-my-score/">the previous note</a>, where Recall@10, MRR and reciprocal-rank fusion are defined, dense search has the highest Recall@10 at 0.724 and the hybrid the lowest at 0.690. But the paired-difference intervals reported in the ablation all cross zero<sup id="fnref:ci" role="doc-noteref"><a href="#fn:ci" class="footnote" rel="footnote">1</a></sup>: dense over hybrid is +0.034, with an interval from −0.069 to +0.121. The two metrics do not even agree on direction — the hybrid is lowest on Recall@10 and highest on MRR, and that gap is within noise too. At twenty-nine queries there is no ordering to report.</p>

<figure class="figure">
  <img src="/images/notes/recall-forest-within-noise.svg" alt="Forest plot of pairwise Recall@10 differences over 29 queries with 95% bootstrap confidence intervals. Dense minus hybrid is +0.034 and lexical minus hybrid is +0.023; both intervals straddle the zero line." />
  <figcaption>
    Pairwise Recall@10 differences on the 500-abstract pilot, 29 queries, with 95% paired-difference bootstrap intervals (10,000 resamples, seed 20260531). Dense edges hybrid by +0.034 and lexical edges hybrid by +0.023, but every interval crosses zero — at this sample size the aggregate ranking of the three arms is within noise, which is why the decision below rests on the findings that survive it rather than on which arm sits highest.
  </figcaption>
</figure>

<p>My three predictions were that lexical search would beat the hybrid on queries with one strong single-arm answer, that dense would beat lexical on broad topic questions, and that the hybrid would win on average while losing on the tails. The middle one held in direction only — dense edged lexical on the broad queries, 0.625 to 0.583, well inside its interval. The “wins on average” prediction is exactly the averaged claim this sample cannot settle. The first prediction is the one that survived, and it survived as a single query.</p>

<h2 id="two-results-that-survive-a-small-sample">Two results that survive a small sample</h2>

<p>First, the depth-ten ranking does not survive to depth fifty. Measured at Recall@50 the order inverts: the hybrid moves from last to first, 0.966, and dense from first to last, 0.897. I am not claiming the hybrid significantly wins at depth — same small sample — but a ranking that flips when you move the cutoff is not one to trust at either cutoff.</p>

<p>Second, and more durable because it does not lean on an average, is query 12. Its landmark paper for WASP-96b sits at rank 4 in lexical search and rank 338 in dense — effectively invisible to the dense arm. Fusion then buries it anyway. Reciprocal-rank fusion at k=60 rewards papers that both arms return: a paper one arm ranks, say, 8th and the other 15th scores about 0.028 (1/68 + 1/75), enough to outrank a paper only one arm rates highly — like the WASP-96b result, about 0.016 from its lexical rank of 4. So for query 12 a paper lexical search alone would have placed fourth never reaches the hybrid’s top ten. That one query is the cleanest case for keeping a lexical arm, and a concrete reminder that fusion can bury a paper exactly one arm is sure of.</p>

<h2 id="what-i-will-actually-change">What I will actually change</h2>

<p>The complementarity is real but thin. Pool the top fifty from each arm and 42 of the 49 relevant documents are found by both; only five are unique to one arm, and of those only query 12 is genuinely blind to the other — the rest sit just past the cutoff, at ranks in the fifties to nineties, reachable rather than missed. Two relevant papers neither arm surfaced at depth fifty at all.</p>

<p>I am keeping the hybrid as the stage-one candidate generator regardless. The job of stage one is to land the relevant papers somewhere in the pool a reranker will read, not to order them perfectly — so the number I should be optimising is candidate-set recall at the pool depth, not Recall@10. Recall@10 was the wrong target for a stage that feeds a second stage; it was measuring the reranker’s job before the reranker existed.</p>

<p>This is all still 500 abstracts. The question I am carrying into the corpus-widening work is whether the dense-blind class — papers like query 12’s — grows as the corpus grows, or stays a handful of edge cases. That is <a href="/writing/notes/widening-that-lowered-every-score/">the next note</a>.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:ci" role="doc-endnote">
      <p>Each interval comes from ten thousand bootstrap resamples of the per-query scores (seed 20260531). An interval that spans zero means the data are consistent with no difference between the arms. <a href="#fnref:ci" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content>
      <author>
        <name>Nandan Joshi</name>
      </author>
      <category term="AstroLLM" /><category term="Retrieval" /><category term="Evaluation" /><category term="Ablation" />
      <summary type="html">I wrote down three predictions about which retrieval arm would win, then watched the aggregate Recall@10 ranking dissolve into noise at twenty-nine queries. The findings that held up were single queries, not averages.</summary>
    </entry>
  
    <entry>
      <title type="html">The label review that lowered my score</title>
      <link href="https://nandan.me/writing/notes/label-review-that-lowered-my-score/" rel="alternate" type="text/html" title="The label review that lowered my score" />
      <published>2026-05-31T00:00:00+00:00</published>
      <updated>2026-05-31T00:00:00+00:00</updated>
      <id>https://nandan.me/writing/notes/label-review-that-lowered-my-score/</id>
      <content type="html" xml:base="https://nandan.me/writing/notes/label-review-that-lowered-my-score/"><![CDATA[<p>The first pass at my retrieval pilot scored Recall@10 = 0.812 over sixteen graded queries, against 500 real exoplanet-atmosphere abstracts. Then I reviewed the relevance labels those scores were measured against, and the figure fell to 0.690 across twenty-nine queries. The drop is the part of the exercise I trust.</p>

<!--more-->

<h2 id="what-the-pilot-does">What the pilot does</h2>

<p>AstroLLM is meant to cite real papers instead of inventing them, which makes retrieval the grounding layer: given an astronomy question, return the abstracts most likely to answer it. The pilot corpus is 500 abstracts from NASA ADS — the query <code class="language-plaintext highlighter-rouge">abs:"exoplanet atmosphere"</code>, 2018 onward, 500 of 4,845 matches. Two engines run in parallel and are then fused: a dense vector search (BGE-small, 384 dimensions, in pgvector) that matches on meaning, and a lexical BM25 index (SQLite FTS5) that matches on words. Reciprocal-rank fusion<sup id="fnref:rrf" role="doc-noteref"><a href="#fn:rrf" class="footnote" rel="footnote">1</a></sup> merges their two ranked lists. I grade the merged list with Recall@10<sup id="fnref:recall" role="doc-noteref"><a href="#fn:recall" class="footnote" rel="footnote">2</a></sup> and MRR<sup id="fnref:mrr" role="doc-noteref"><a href="#fn:mrr" class="footnote" rel="footnote">3</a></sup>.</p>

<h2 id="why-the-number-moved-down">Why the number moved down</h2>

<p>Three things happened in the review, and only the first was flattering. Re-reading query 06, I found I’d labelled a transmission-spectrum paper relevant to a question about dayside thermal emission — a different observable. Removing it pushed the headline up to 0.844. Had I stopped there, the review would have “confirmed” a score better than the one I started with, which is exactly the kind of review to distrust.</p>

<p>Continuing honestly pulled it back. I graded the conceptual queries I’d skipped on the first pass, and I stopped rounding away real misses: query 11 (WASP-121b) has no correct paper in its top ten, which is a 0.00 and belongs in the denominator. Twenty-nine of thirty queries ended up scored — one had no correct paper anywhere in the corpus — and the corpus-wide figure settled at Recall@10 0.690 and MRR 0.623.</p>

<h2 id="labels-track-relevance-not-rank">Labels track relevance, not rank</h2>

<p>The rule I held to: a label records whether a paper answers the query, not whether the ranker found it. Query 08’s top fused result — ranked first by the system — got marked irrelevant, because the abstract did not address what was asked. Queries 03, 05, and 18 stayed at 0.50, genuine partial misses I could have rationalised away. One correction went the other way: I’d overlooked that a sodium-detection paper was a valid second answer for query 07, which raised its reciprocal rank from 0.17 to 1.00. Corrections in both directions, judged from the abstract, never from where the ranker had placed it.</p>

<h2 id="where-it-is-actually-weak">Where it is actually weak</h2>

<p>Split by query type, the result that matters shows up. On named-target queries — a specific planet, a specific measurement — Recall@10 is 0.794. On broad known-item queries — topic-shaped questions whose answer key is one or two landmark papers — it is 0.542. Lexical search is good with identifiers like WASP-39b; dense search handles paraphrase; the current hybrid first-stage baseline is weaker on topic-shaped questions with no single string to match. A cross-encoder reranker is the next hypothesis for closing that gap, because it reads the query and abstract together instead of scoring them apart.</p>

<p>Two caveats keep me honest about the 0.542. There are only twelve broad known-item queries and seventeen named-target ones — small on both sides. And because those answer keys are landmark papers rather than exhaustive topical relevance, 0.542 is the optimistic reading; true topical recall is likely worse. The corpus is 500 abstracts, too: an earlier 14-document synthetic fixture scored a perfect 1.000/1.000, which only demonstrates that a number without a real, confusable corpus is theatre.</p>

<p>Before building the reranker, though, I wanted to know whether hybrid fusion was even the right first-stage baseline. So I ran an ablation, switching each engine off in turn to see what it contributed. That is the <a href="/writing/notes/headline-that-was-within-noise/">next note</a>.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:rrf" role="doc-endnote">
      <p>Reciprocal-rank fusion — combine two ranked lists using only each document’s rank in each, scoring 1 ÷ (60 + rank) and summing. Because it ignores the raw similarity scores, it needs no normalisation between the dense and lexical engines. <a href="#fnref:rrf" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:recall" role="doc-endnote">
      <p>Recall@10 — of the papers labelled relevant for a query, the fraction that land in the top ten results, averaged over queries. With a single target paper per query it reduces to whether the right paper made the top ten. <a href="#fnref:recall" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:mrr" role="doc-endnote">
      <p>MRR, mean reciprocal rank — 1 ÷ (rank of the first relevant hit), averaged over queries: first place scores 1.0, third place 0.33, not found 0. It rewards ranking the answer high rather than merely surfacing it. <a href="#fnref:mrr" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content>
      <author>
        <name>Nandan Joshi</name>
      </author>
      <category term="AstroLLM" /><category term="Retrieval" /><category term="Evaluation" />
      <summary type="html">A retrieval pilot over 500 real exoplanet papers scored Recall@10 in the low 0.8s; reviewing my own relevance labels pulled it down to 0.69. The drop is the part worth trusting.</summary>
    </entry>
  
  
</feed>
