Search papers, labs, and topics across Lattice.
D Gaussian filter to the raw joint sequences. The resulting processed robot dataset is represented as: 𝐃robot_clip={𝜽,𝜽˙,𝐯,𝝎,𝐜,T}\mathbf{D}^{\mathrm{robot\_clip}}=\{\boldsymbol{\theta},\dot{\boldsymbol{\theta}},\mathbf{v},\boldsymbol{\omega},\mathbf{c},T\} (8) where T denotes the duration of a single gait cycle. III-B2 Reference Gait Generation Given the commanded base velocity 𝐯=[vx,vy,ω]\mathbf{v}=[v_{x},v_{y},\omega] and the gait phase ϕ∈[0,1)\phi\in[0,1), the reference joint trajectory is synthesized through a unified weighted interpolation framework. For each velocity channel x∈{vx,vy,ω}x\in\{v_{x},v_{y},\omega\}, we determine an interpolation coefficient based on the magnitude of the commanded velocity. Assume that the dataset contains a set of motion templates {θi(ϕ),Ti}\{\theta_{i}(\phi),T_{i}\}, where θi(ϕ)\theta_{i}(\phi) denotes the phase-dependent joint trajectory and TiT_{i} is the corresponding gait period. Given a commanded velocity uxu_{x}, we select its two neighboring nominal velocities ulu_{l} and uuu_{u}, and define the interpolation factor as α=clip(|ux|−uluu−ul+ε,0,1)\alpha=\mathrm{clip}\left(\frac{|u_{x}|-u_{l}}{u_{u}-u_{l}+\varepsilon},0,1\right) (9) where ε\varepsilon is a small constant for numerical stability. The commanded gait period is obtained via linear interpolation: Tu=(1−α)Tl+αTu.T_{u}=(1-\alpha)T_{l}+\alpha T_{u}. (10) The gait phase is then updated according to the normalized phase progression rule: ϕt+1=(ϕt+ΔtTu)mod1.\phi_{t+1}=\left(\phi_{t}+\frac{\Delta t}{T_{u}}\right)\bmod 1. (11) After obtaining the updated phase ϕ\phi, the reference joint trajectory is synthesized by blending the neighboring motion templates: θd(ϕ)=(1−α)θl(ϕ)+αθu(ϕ).\theta_{d}(\phi)=(1-\alpha)\,\theta_{l}(\phi)+\alpha\,\theta_{u}(\phi). (12) To handle near-zero velocity commands, we introduce a standing threshold vthv_{\text{th}}: θd(ϕ)={θstand,‖v‖≤vth,(1−α)θl(ϕ)+αθu(ϕ),otherwise.\theta_{d}(\phi)=\begin{cases}\theta_{\text{stand}},&\|v\|\leq v_{\text{th}},\\ (1-\alpha)\theta_{l}(\phi)+\alpha\theta_{u}(\phi),&\text{otherwise}.\end{cases} (13) In addition, a velocity-dependent stance ratio ρ(𝐯)\rho(\mathbf{v}) is defined to construct phase-based contact indicators rL(ϕ)r^{L}(\phi) and rR(ϕ)r^{R}(\phi), which are later used for contact supervision and reward formulation. This velocity-conditioned interpolation strategy ensures smooth transitions across different commanded speeds and maintains temporal consistency via unified phase evolution. III-B3 Gait-aware Reward Design To encourage the policy to maintain consistency with the reference gait over complex terrains, we construct a set of exponential tracking reward terms based on the target joint trajectories and commanded base velocities generated by the reference gait module. Each reward term follows a unified exponential form: ri=exp(−λiei),r_{i}=\exp(-\lambda_{i}e_{i}), (14) where eie_{i} denotes the tracking error of the corresponding physical quantity, and λi\lambda_{i} is a scaling coefficient. All gait-related reward terms are combined as a weighted summation: rgait=∑iwiri,r_{\text{gait}}=\sum_{i}w_{i}r_{i}, (15) where wiw_{i} represents the weight of each component. The detailed definitions of each gait reward term are summarized in Table II. These reward components constrain the policy from four complementary aspects, including pose consistency, velocity matching, dynamic motion trend tracking, and key support joint stabilization. This design allows the robot to preserve the periodic structure imposed by the reference gait generator while maintaining adaptability to complex environments. TABLE II: GAIT-AWARE REWARD COMPONENTS Reward Equation (eie_{i}) Weight (wiw_{i}) rposr_{\text{pos}} ‖𝜽−𝜽d(ϕ)‖2\|\boldsymbol{\theta}-\boldsymbol{\theta}_{d}(\phi)\|^{2} 0.10 rvelr_{\text{vel}} ‖𝐯b−𝐯‖2\|\mathbf{v}_{b}-\mathbf{v}\|^{2} 0.05 rΔr_{\Delta} ‖Δ𝜽−Δ𝜽d(ϕ)‖1\|\Delta\boldsymbol{\theta}-\Delta\boldsymbol{\theta}_{d}(\phi)\|_{1} 0.05 rankler_{\text{ankle}} ∑j∈{L,R}(θj−θd,j(ϕ))2\sum_{j\in\{L,R\}}\left(\theta_{j}-\theta_{d,j}(\phi)\right)^{2} 0.05 III-C High-Throughput Training Infrastructure To address the computational overhead and GPU memory (VRAM) bottlenecks caused by high-dimensional depth perception in massively parallel environments, we developed a systematic engineering optimization framework on the NVIDIA Isaac Sim and Isaac Lab platform. Based on our self-developed ZERITH Z1 humanoid robot model, we constructed the training environment and improved overall system throughput from two perspectives: memory management and rendering strategy. III-C1 Heterogeneous Observation Buffer Management VRAM capacity is a primary factor limiting the degree of parallelism (i.e., NenvN_{\mathrm{env}}) in reinforcement learning. To address the high-dimensional tensor storage pressure introduced by depth images, we propose a heterogeneous memory management scheme designed to decouple physics computation from data caching. Under this mechanism, VRAM serves only as a transient buffer for rendering outputs, while the generated observation tensors are asynchronously transferred to host memory (RAM) for storage and indexing. Experimental results demonstrate that this strategy significantly frees GPU computational resources. As shown in Table III, on an RTX 4090 (24 GB) platform, the maximum number of parallel environments for vision-based tasks increases from the baseline of 512 to 1024. In a 48 GB VRAM configuration, the parallel scale further extends to 1536 environments, substantially improving sample efficiency during training. TABLE III: PARALLEL SCALE UNDER DIFFERENT MEMORY STRATEGIES GPU Max NenvN_{\mathrm{env}} Storage VRAM 4090 (
NVIDIA Research1
0
3
2
Autonomous exploration by an LLM agent dramatically outperforms both rigid retrieval workflows and supervised fine-tuning for temporal knowledge graph question answering, achieving state-of-the-art results in a zero-shot setting.