Shorter, lower-overhead pieces: paper reading notes, conference reflections, and project-status updates from QMI Lab and AstroLLM. Long-form essays live at /writing/.

Subscribe via the notes feed.

  • The deeper pool that recovered nothing

    When the corpus grew, the relevant papers stayed in the candidate pool but the fused ranking stopped surfacing them. I deepened the pool from 50 to 500 candidates per arm; the candidate union reached a perfect 1.000 and the fused top-10 did not change by a single query.

  • The widening that lowered every score

    I made the corpus five times bigger expecting better coverage. Every score on the same queries got worse, and the gap I had already decided to build on grew wider.

  • The headline that was within noise

    I wrote down three predictions about which retrieval arm would win, then watched the aggregate Recall@10 ranking dissolve into noise at twenty-nine queries. The findings that held up were single queries, not averages.

  • The label review that lowered my score

    A retrieval pilot over 500 real exoplanet papers scored Recall@10 in the low 0.8s; reviewing my own relevance labels pulled it down to 0.69. The drop is the part worth trusting.