May 5, 2026arXiv:2605.04265

Benchmarking open-source tools for in silico antiviral drug discovery

AI Summary

This paper surveys open-source datasets and computational tools, including AI-based systems and docking tools, for in silico antiviral drug discovery. The authors created a custom dataset of 43,005 viral protein-ligand binding measurements, addressing data quality issues like polyprotein sequence splitting. Benchmarking 15 open-source binding affinity prediction tools revealed that Boltz-2 and DrugFormDTA performed best among ML-based approaches, while GNINA excelled in docking, with performance varying across viral proteins; fine-tuning DrugFormDTA on the custom dataset improved its correlation from 0.5 to 0.7.

Key Contribution

Public antiviral drug discovery datasets are riddled with errors that can be fixed with careful polyprotein splitting, unlocking significant performance gains in binding affinity prediction.

Abstract

Antivirals are uniquely positioned to be deployed quickly during a new outbreak, especially when repurposed from approved drugs. Yet there are no FDA-approved antivirals for the majority of viral families with pandemic potential. Here we lay out the case for investing in technologies and techniques for antiviral drug discovery and designing antiviral combinations. We present a survey of open source datasets and computational tools for in silico antiviral drug discovery, with a particular focus on the latest AI-based systems and docking tools. We then present our custom dataset of 43,005 viral protein-ligand binding measurements that we curated from BindingDB and other sources. Importantly, we found that 31% of viral protein binding data in BindingDB required polyprotein sequences to be carefully split before the data were suitable for training or testing ML models. Using our custom dataset we fine-tuned the DrugFormDTA binding affinity prediction model (Khokhlov et al. 2025). We then benchmarked 15 open-source binding affinity prediction tools on a custom test set of 853 antiviral compounds spread across 16 different protein targets from 10 virus species. Models tested include Boltz-2, GNINA, FlowDock, Interformer, AutoDock-GPU, and others. We found that Boltz-2 and DrugFormDTA ranked highest overall among ML-based approaches, and GNINA did best among docking approaches, with notable variance across specific viral proteins. Fine-tuning DrugFormDTA on our custom cleaned antiviral dataset boosted performance from $r=0.5$ to $r=0.7$. As part of this work we also compiled a library of approved drugs and a comprehensive list of investigational and approved antiviral drugs that can be viewed at https://antivirals-database.radvac.org. Together, this work provides a foundation for future work towards new tools and platforms for rapid drug repurposing and rapid design of antiviral combinations.

Eval Frameworks & Benchmarks Open-Source Models & Weights Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Benchmarking open-source tools for in silico antiviral drug discovery

Related Papers