Search papers, labs, and topics across Lattice.
TurboMem is a lock-free memory pool designed to improve packet processing performance in DPDK by addressing scalability limits of traditional memory allocators. It leverages atomic stacks, per-core local caches, and Transparent Huge Page (THP) auto-merging via `madvise(MADV_HUGEPAGE)` to reduce lock contention, cache-coherence overhead, and TLB pressure. Mock benchmarks show TurboMem achieves up to 28% higher throughput and 41% fewer TLB misses compared to standard DPDK mempools with explicit huge pages, suggesting THP auto-merging can outperform manual huge page management.
Ditch manual huge page configuration: TurboMem's lock-free design and transparent huge page auto-merging can boost packet throughput by up to 28% in DPDK.
High-speed packet processing on multicore CPUs places extreme demands on memory allocators. In systems like DPDK, fixed-size memory pools back packet buffers (mbufs) to avoid costly dynamic allocation. However, even DPDK's optimized mempool faces scalability limits: lock contention on the shared ring, cache-coherence ping-pong between cores, and heavy TLB pressure from thousands of small pages. To mitigate these issues, DPDK typically uses explicit huge pages (2 MB or 1 GB) for its memory pools. This reduces TLB misses but requires manual configuration and can lead to fragmentation and inflexibility. We propose TurboMem, a novel C++ template-based memory pool that addresses these challenges. TurboMem combines a fully lock-free design (using atomic stacks and per-core local caches) with Transparent Huge Page (THP) auto merging. By automatically promoting pools to 2 MB pages via madvise(MADV_HUGEPAGE), TurboMem achieves the benefits of huge pages without manual setup. We also enforce strict NUMA locality and CPU affinity, so each core allocates and frees objects from its local node. Using Intel VTune on a single-socket 100 Gbps testbed, we show that TurboMem boosts packet throughput by up to 28% while reducing TLB misses by 41% compared to a standard DPDK mempool with explicit huge pages. These results demonstrate that THP auto-merging can outperform manually reserved huge pages in low-fragmentation scenarios, and that modern C++ lock-free programming yields practical gains in data-plane software. Note: The performance claims reported in this preliminary version (up to 28% higher throughput and 41% fewer TLB misses) are based on mock benchmarks. Comprehensive real-system evaluations using Intel VTune are currently underway and will be presented in a future revision.