Search papers, labs, and topics across Lattice.
This paper introduces the Multi-Scale Reversible Chaos Game Representation (MS-RCGR), a novel encoding framework that transforms biological sequences into multi-resolution geometric representations while ensuring reversibility. By utilizing rational arithmetic and hierarchical k-mer decomposition, MS-RCGR generates scale-invariant features that enhance classification performance across traditional machine learning, computer vision, and hybrid approaches. Comprehensive experiments reveal that the hybrid method combining MS-RCGR with pre-trained language model embeddings significantly outperforms existing techniques, demonstrating its potential as a robust tool for biological sequence analysis.
MS-RCGR not only preserves complete sequence information but also enhances classification performance across diverse analytical paradigms, making it a game-changer for biological sequence analysis.
Biological classification with interpretability remains a challenging task. For this, we introduce a novel encoding framework, Multi-Scale Reversible Chaos Game Representation (MS-RCGR), that transforms biological sequences into multi-resolution geometric representations with guaranteed reversibility. Unlike traditional sequence encoding methods, MS-RCGR employs rational arithmetic and hierarchical k-mer decomposition to generate scale-invariant features that preserve complete sequence information while enabling diverse analytical approaches. Our framework bridges three distinct paradigms for sequence analysis: (1) traditional machine learning using extracted geometric features, (2) computer vision models operating on CGR-generated images, and (3) hybrid approaches combining protein language model embeddings with CGR features. Through comprehensive experiments on synthetic DNA and protein datasets encompassing seven distinct sequence classes, we demonstrate that MS-RCGR features consistently enhance classification performance across all paradigms. Notably, our hybrid approach combining pre-trained language model embeddings (ESM2, ProtT5) with MS-RCGR features achieves superior performance compared to either method alone. The reversibility property of our encoding ensures no information loss during transformation, while multi-scale analysis captures patterns ranging from individual nucleotides to complex motif structures. Our results indicate that MS-RCGR provides a flexible, interpretable, and high-performing foundation for biological sequence analysis.