02 Aug 2025

Baby_llm

Intro

Sglang Runtime is mainly composed of three components: TokenizerManager, Scheduler, and DetokenizerManager.

Overall

[WIP] Key Features

Dynamic Feedback Schudule
Chunked Prefill
DP/TP Implementation
Coputation & Communication Overlap

Take aways

Challenge: Need to allocate variable-length memory space for dynamic workloads
Solution: Treat memory as pageable, shareable asset
Challenge: Need to batch process requests with different completion times
Solution: Continuous batching and dynamic scheduling
Challenge: Need to reuse computations for overlapping or prefixed data
Solution: Prefix caching and radix-based sharing
Challenge: Need to accelerate processing under uncertainty
Solution: Speculative decoding and fused operations
Challenge: Need to scale across distributed resources efficiently
Solution: Tensor and pipeline parallelism with minimal communication

Reference

Sglang source code up to Commit 0346489 https://github.com/sgl-project/sglang
https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/sglang/kvcache-code-walk-through/readme-CN.md