Why record_stream Matters: A Real CUDA Memory Bug in SGLang
Background
PyTorch’s CUDA memory caching allocator is designed to be fast — it recycles GPU memory aggressively based on Python reference counting. This works transparently most of the time, but it has a blind spot: it only tracks the stream on which a tensor was originally allocated.
When a tensor is passed to a different CUDA stream, the allocator has no idea. If Python references to that tensor drop to zero at the wrong moment, the allocator can reclaim the memory while the second stream’s kernel is still reading it. The result is a silent use-after-free on the GPU: corrupted data, out-of-bounds crashes, or worse — wrong answers with no error at all.
This exact bug was recently caught in SGLang’s speculative decoding (EAGLE v2), and fixed with a single call to record_stream.
The Bug: Index Out of Bounds in Speculative Decoding
Symptom
SGLang users running DeepSeek-R1 on 8× B200/GB200 GPUs with speculative decoding (EAGLE, 2 steps) at high concurrency (≥ 1024) saw intermittent crashes like:
1 | vectorized_gather_kernel: block: [0,1,0], thread: [64,0,0] |
The indices tensor used to gather from topk_p_buf, topk_index_buf, hidden_states_buf, etc. contained garbage — both negative values and values far exceeding the buffer size.
Root Cause
The race condition unfolds in this sequence:
- An
indicestensor is allocated on PyTorch’s default stream. - The object holding that tensor (
spec_info) is replaced —batch.spec_infois assigned a new value, dropping the last Python reference to the old one. - PyTorch’s caching allocator sees refcount → 0 and reclaims the GPU memory.
- Meanwhile, the forward/compute stream enqueues kernels that read from those same indices.
- The forward stream reads from memory that has already been freed and partially overwritten → corrupted indices → out-of-bounds crash.
The key insight: steps 3 and 4 can race because they happen on different streams with no synchronization between them.
The Fix (PR #18958)
1 | # Before (broken) |
One line. indices.record_stream(stream) tells the caching allocator: “the current stream is also using this tensor — don’t free it until this stream has passed this point.” The race is eliminated.
What record_stream Does
1 | tensor.record_stream(stream) |
Registers stream as a consumer of tensor‘s underlying GPU memory. The caching allocator will not reuse that memory until all enqueued operations on stream at the time of the call have completed.
When you need it: any time a tensor is:
- allocated (or last used) on stream A
- then handed off to stream B for further GPU work
- and Python references may drop to zero before stream B finishes
When you don’t need it: if you hold a Python reference to the tensor until after stream B is done (e.g., you explicitly synchronize before releasing it), the reference itself keeps the memory alive.
Minimal Reproducer
1 | import gc |
CUDA_LAUNCH_BLOCKING=1 python show_bug.py
Hides the bug;
python show_bug.py
will display the difference. And src.record_stream(side) fixes it. Not sure why another cannot reproduce.
Key Takeaways
- The PyTorch caching allocator tracks allocation stream, not usage stream. Cross-stream tensor handoffs without
record_streamare a latent use-after-free. - The bug is silent until it isn’t. At low concurrency or on slower hardware, the race window may never open. It took 8× B200s at 1024 concurrent requests to expose it reliably.
- The fix is cheap.
record_streamcosts essentially nothing — it just registers a stream with the allocator’s internal bookkeeping. - Speculative decoding amplifies the risk. SGLang’s overlap scheduler runs prefill and decode on separate streams concurrently, and
spec_infoobjects have short, non-obvious lifetimes — a perfect storm for this class of bug.
References
- SGLang issue #18744 — original bug report
- SGLang PR #18958 — the fix
- PyTorch docs:
Tensor.record_stream - PyTorch CUDA memory management