Feb 25, 2026arXiv:2602.21627

Tokenizing Semantic Segmentation with RLE

Abhineet Singh, Justin Rozeboom, Nilanjan Ray

AI Summary

This paper introduces a novel approach to semantic and panoptic segmentation by framing it as a language modeling task, where segmentation masks are represented as sequences of discrete tokens derived from Run Length Encoding (RLE). The authors modify the Pix2Seq architecture to autoregressively generate these RLE tokens, incorporating strategies to compress token sequence length for efficient video processing and panoptic segmentation. Experiments on image and video segmentation datasets demonstrate competitive performance despite computational limitations.

Key Contribution

Autoregressive language models can now perform semantic and panoptic segmentation by directly predicting run-length encoded mask tokens, offering a unified approach for images and videos.

Abstract

This paper presents a new unified approach to semantic segmentation in both images and videos by using language modeling to output the masks as sequences of discrete tokens. We use run length encoding (RLE) to discretize the segmentation masks and then train a modified version of Pix2Seq \cite{p2s} to output these RLE tokens through autoregression. We propose novel tokenization strategies to compress the length of the token sequence to make it practicable to extend this approach to videos. We also show how instance information can be incorporated into the tokenization process to perform panoptic segmentation. We evaluate our proposed models on two datasets to show that they are competitive with the state of the art in spite of being bottlenecked by our limited computational resources.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Tokenizing Semantic Segmentation with RLE

Related Papers