The widening that lowered every score I made the corpus five times bigger expecting better coverage. Every score on the same queries got worse, and the gap I had already decided to build on grew wider. The deeper pool that recovered nothing When the corpus grew, the relevant papers stayed in the candidate pool but the fused ranking stopped surfacing them. I deepened the pool from 50 to 500 candidates per arm; the candidate union reached a perfect 1.000 and the fused top-10 did not change by a single query. The label review that lowered my score A retrieval pilot over 500 real exoplanet papers scored Recall@10 in the low 0.8s; reviewing my own relevance labels pulled it down to 0.69. The drop is the part worth trusting. The headline that was within noise I wrote down three predictions about which retrieval arm would win, then watched the aggregate Recall@10 ranking dissolve into noise at twenty-nine queries. The findings that held up were single queries, not averages.