Baby_llm
Intro
Sglang Runtime is mainly composed of three components: TokenizerManager, Scheduler, and DetokenizerManager.
Overall
[WIP] Key Features
- Dynamic Feedback Schudule
- Chunked Prefill
- DP/TP Implementation
- Coputation & Communication Overlap
Take aways
- Challenge: Need to allocate variable-length memory space for dynamic workloads
- Solution: Treat memory as pageable, shareable asset
- Challenge: Need to batch process requests with different completion times
- Solution: Continuous batching and dynamic scheduling
- Challenge: Need to reuse computations for overlapping or prefixed data
- Solution: Prefix caching and radix-based sharing
- Challenge: Need to accelerate processing under uncertainty
- Solution: Speculative decoding and fused operations
- Challenge: Need to scale across distributed resources efficiently
- Solution: Tensor and pipeline parallelism with minimal communication
Reference
- Sglang source code up to Commit 0346489 https://github.com/sgl-project/sglang
- https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/sglang/kvcache-code-walk-through/readme-CN.md