Search papers, labs, and topics across Lattice.
This paper investigates methods for measuring the relatedness of scientific publications using controlled vocabularies, addressing the limitations of Salton's cosine similarity which only considers exact term matches. The authors introduce and compare soft cosine and maximum term similarities, which account for semantic similarity between terms. Results using the TREC 2006 Genomics Track dataset demonstrate that soft cosine similarity outperforms Salton's cosine in accurately assigning relatedness scores based on topic coherence.
Ditch exact-match cosine similarity for controlled vocabularies: soft cosine similarity gives you a more accurate measure of scientific publication relatedness.
Measuring the relatedness between scientific publications is essential in many areas of bibliometrics and science policy. Controlled vocabularies provide a promising basis for measuring relatedness and are widely used in combination with Salton's cosine similarity. The latter is problematic because it only considers exact matches between terms. This article introduces two alternative methods - soft cosine and maximum term similarities - that account for the semantic similarity between non-matching terms. The article compares the accuracy of all three methods using the assignment of publications to topics in the TREC 2006 Genomics Track and the assumption that accurate relatedness measures should assign high relatedness scores to publication pairs within the same topic and low scores to pairs from separate topics. Results show that soft cosine is the most accurate method, while the most widely used version of Salton's cosine is markedly less accurate than the other methods tested. These findings have implications for how controlled vocabularies should be used to measure relatedness.