Search papers, labs, and topics across Lattice.
This paper introduces SentGuard, a novel sentence-level streaming guardrail for large language models that moderates content in real time by grouping streamed tokens into sentence chunks. By utilizing a lightweight waiting buffer, SentGuard allows for the assessment of safety risks at sentence boundaries, significantly improving the timing and accuracy of interventions compared to existing response-level and token-level methods. Experiments demonstrate that SentGuard detects 90.5% of unsafe cases within two sentences while achieving a low false-positive rate of 7.41%, marking a substantial advancement in safety measures for streaming LLMs.
SentGuard detects 90.5% of unsafe content within two sentences, revolutionizing real-time moderation for large language models.
Large language models increasingly stream long, reasoning-intensive responses in real time, making when to moderate as critical as whether to moderate. Existing guardrails fall into two unsatisfactory extremes: response-level methods delay intervention until the full output is generated, whereas token-level methods act on incomplete semantics, often producing unstable decisions and excessive guard invocations. To address this challenge, we propose SentGuard, a sentence-level streaming guardrail that operates in parallel with generation. A lightweight waiting buffer groups streamed tokens into sentence chunks and releases only verified chunks to the user, introducing a small offset that enables SentGuard to assess the current prefix while the target LLM decodes subsequent content. To support this, we construct StreamSafe, a benchmark with structured per-sentence annotations across 8 harm categories, capturing the evolution of safety risks across both reasoning and response segments. We further train SentGuard with a coarse-to-fine objective to detect unsafe intent as soon as it emerges at sentence boundaries. Experiments on 5 safety benchmarks show that SentGuard outperforms existing baselines, detecting 90.5% of unsafe cases within two sentences while maintaining a low streaming false-positive rate of 7.41%.