Mar 18, 2026arXiv:2603.17737

Embedding World Knowledge into Tabular Models: Towards Best Practices for Embedding Pipeline Design

Oksana Kolomenko, Ricardo Knauer, Erik Rodner

AI Summary

This paper benchmarks 256 LLM-based embedding pipelines for tabular prediction, varying preprocessing, embedding models, and downstream models. The study finds that pipeline design significantly impacts performance, with concatenation of embeddings generally outperforming replacement of original columns. Larger embedding models tend to perform better, while leaderboard rankings are unreliable performance indicators, and gradient boosting decision trees are effective downstream models.

Key Contribution

Forget chasing leaderboard hype: this study reveals that larger embedding models and strategic concatenation are key to unlocking LLM-powered tabular prediction, regardless of public rankings.

Abstract

Embeddings are a powerful way to enrich data-driven machine learning models with the world knowledge of large language models (LLMs). Yet, there is limited evidence on how to design effective LLM-based embedding pipelines for tabular prediction. In this work, we systematically benchmark 256 pipeline configurations, covering 8 preprocessing strategies, 16 embedding models, and 2 downstream models. Our results show that it strongly depends on the specific pipeline design whether incorporating the prior knowledge of LLMs improves the predictive performance. In general, concatenating embeddings tends to outperform replacing the original columns with embeddings. Larger embedding models tend to yield better results, while public leaderboard rankings and model popularity are poor performance indicators. Finally, gradient boosting decision trees tend to be strong downstream models. Our findings provide researchers and practitioners with guidance for building more effective embedding pipelines for tabular prediction tasks.

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Embedding World Knowledge into Tabular Models: Towards Best Practices for Embedding Pipeline Design

Related Papers