Search papers, labs, and topics across Lattice.
ZTab, a domain-based zero-shot framework, tackles the problem of automatically detecting semantic column types in relational tables without relying on user-provided labeled data. The method generates pseudo-tables based on predefined semantic types and sample schemas within a specified domain, then fine-tunes an annotation LLM on this synthetic data. Experiments demonstrate that ZTab achieves a balance between zero-shot capabilities and annotation performance by allowing for either a "universal domain" or a more "specialized domain" tailored to specific applications.
Achieve strong zero-shot performance in semantic column type detection by fine-tuning LLMs on synthetically generated pseudo-tables tailored to specific domains.
This study addresses the challenge of automatically detecting semantic column types in relational tables, a key task in many real-world applications. Zero-shot modeling eliminates the need for user-provided labeled training data, making it ideal for scenarios where data collection is costly or restricted due to privacy concerns. However, existing zero-shot models suffer from poor performance when the number of semantic column types is large, limited understanding of tabular structure, and privacy risks arising from dependence on high-performance closed-source LLMs. We introduce ZTab, a domain-based zero-shot framework that addresses both performance and zero-shot requirements. Given a domain configuration consisting of a set of predefined semantic types and sample table schemas, ZTab generates pseudo-tables for the sample schemas and fine-tunes an annotation LLM on them. ZTab is domain-based zero-shot in that it does not depend on user-specific labeled training data; therefore, no retraining is needed for a test table from a similar domain. We describe three cases of domain-based zero-shot. The domain configuration of ZTab provides a trade-off between the extent of zero-shot and annotation performance: a"universal domain"that contains all semantic types approaches"pure"zero-shot, while a"specialized domain"that contains semantic types for a specific application enables better zero-shot performance within that domain. Source code and datasets are available at https://github.com/hoseinzadeehsan/ZTab