Search papers, labs, and topics across Lattice.
The paper introduces MOOSE-Star, a framework for directly training LLMs to model the generative reasoning process in scientific discovery, $P(h|b)$, which is typically intractable due to combinatorial complexity. MOOSE-Star achieves tractability by decomposing the problem into subtasks, using motivation-guided hierarchical search for logarithmic retrieval complexity, and employing bounded composition for robustness. The authors also release TOMATO-Star, a dataset of 108,717 decomposed scientific papers, and demonstrate that MOOSE-Star exhibits continuous test-time scaling, overcoming the complexity wall faced by brute-force sampling.
LLMs can now directly model the generative reasoning process for scientific discovery, thanks to a complexity-breaking framework that reduces exponential search to logarithmic.
While large language models (LLMs) show promise in scientific discovery, existing research focuses on inference or feedback-driven training, leaving the direct modeling of the generative reasoning process, $P(\text{hypothesis}|\text{background})$ ($P(h|b)$), unexplored. We demonstrate that directly training $P(h|b)$ is mathematically intractable due to the combinatorial complexity ($O(N^k)$) inherent in retrieving and composing inspirations from a vast knowledge base. To break this barrier, we introduce MOOSE-Star, a unified framework enabling tractable training and scalable inference. In the best case, MOOSE-Star reduces complexity from exponential to logarithmic ($O(\log N)$) by (1) training on decomposed subtasks derived from the probabilistic equation of discovery, (2) employing motivation-guided hierarchical search to enable logarithmic retrieval and prune irrelevant subspaces, and (3) utilizing bounded composition for robustness against retrieval noise. To facilitate this, we release TOMATO-Star, a dataset of 108,717 decomposed papers (38,400 GPU hours) for training. Furthermore, we show that while brute-force sampling hits a''complexity wall,''MOOSE-Star exhibits continuous test-time scaling.