Search papers, labs, and topics across Lattice.
This paper benchmarks the performance of several proprietary (Claude 3.5, GPT-4o, Gemini 1.5) and open-source (LLaMA 3.2, Gemma 2, Mistral, Nemotron 4) large language models on the task of temporal text classification (TTC), which involves estimating the publication date of texts. The study evaluates zero-shot, few-shot prompting, and fine-tuning approaches. Results show that proprietary models excel, particularly with few-shot prompting, while fine-tuning significantly enhances open-source models, although they still underperform compared to proprietary alternatives.
Despite their general prowess, open-source LLMs still lag behind proprietary models in the nuanced task of dating texts, even after fine-tuning.
Languages change over time. Computational models can be trained to recognize such changes enabling them to estimate the publication date of texts. Despite recent advancements in Large Language Models (LLMs), their performance on automatic dating of texts, also known as Temporal Text Classification (TTC), has not been explored. This study provides the first systematic evaluation of leading proprietary (Claude 3.5, GPT-4o, Gemini 1.5) and open-source (LLaMA 3.2, Gemma 2, Mistral, Nemotron 4) LLMs on TTC using three historical corpora, two in English and one in Portuguese. We test zero-shot and few-shot prompting, and fine-tuning settings. Our results indicate that proprietary models perform well, especially with few-shot prompting. They also indicate that fine-tuning substantially improves open-source models but that they still fail to match the performance delivered by proprietary LLMs.