Huang Yizhuo

02 Aug 2025

Baby_llm

Intro

Sglang Runtime is mainly composed of three components: TokenizerManager, Scheduler, and DetokenizerManager.

sglang_arch.png

Overall

sglang_req1.png

image.png

[WIP] Key Features

  • Dynamic Feedback Schudule
  • Chunked Prefill
  • DP/TP Implementation
  • Coputation & Communication Overlap

Take aways

  • Challenge: Need to allocate variable-length memory space for dynamic workloads
  • Solution: Treat memory as pageable, shareable asset
  • Challenge: Need to batch process requests with different completion times
  • Solution: Continuous batching and dynamic scheduling
  • Challenge: Need to reuse computations for overlapping or prefixed data
  • Solution: Prefix caching and radix-based sharing
  • Challenge: Need to accelerate processing under uncertainty
  • Solution: Speculative decoding and fused operations
  • Challenge: Need to scale across distributed resources efficiently
  • Solution: Tensor and pipeline parallelism with minimal communication

Reference