I built an eBPF profiler to find where vLLM cold-start spends its time. It wasn’t the disk.

A vLLM server loading Mistral-7B takes about eighteen seconds before it answers the first request. The model is 14 GB on disk. The obvious assumption — the one I started with — is that those eighteen seconds are mostly spent reading 14 GB off the disk.

They are not. Disk I/O is about 7% of it. This post is about the tool I built to measure that, and where the other 93% actually goes.

The assumption: cold start is the disk

Cold start is the time between “process starts” and “server can serve a token.” It matters more than it used to. Scale-to-zero inference, autoscaling under bursty load, spot-instance preemption recovery — all of them pay the cold-start cost every time a replica comes up, and on a GPU you are paying by the second for hardware that is doing no useful work during that window. An 18-second cold start on an A10 at $1.29/hr is not much money on its own, but multiply it by every scale event across a fleet and it becomes a line item.

The intuitive model of cold start is I/O-bound: you have to move 14 GB of weights from disk into GPU memory, so the cost is the bandwidth of that path. If that were true, the fix would be faster disks, prefetching, or keeping weights in page cache. So the first question worth answering precisely is: how much of cold start is actually I/O?

To answer it I needed something that could see, on one clock, both the kernel-side I/O (the read/mmap syscalls moving bytes) and the userspace/GPU work (the CUDA calls setting up the device). I had neither tool on hand. So I built one.

Why /proc sampling and OTel weren’t enough here

I have written before about sampling /proc/<pid> to profile a running inference server, and before that about why OpenTelemetry spans show you nothing when the work is compute-bound and happening below your instrumentation. Both tools share a boundary: they stop at the process. /proc sampling tells you RSS and CPU and thread count at some interval; OTel tells you about spans you chose to instrument. Cold start lives below that boundary. The interesting work is individual syscalls (microsecond-scale, far faster than any sampler) and calls into the CUDA driver (which no Python-level tracer sees). A 50 ms sampling loop averages right over the thing you want to see.

eBPF is the right instrument for this. You attach a small program to a kernel tracepoint or to a function in a userspace library; it runs on every hit, in kernel context, and pushes an event into a ring buffer that userspace drains. No sampling interval — you see every call. The cost is per-event, not per-interval, so it scales with how busy the thing is rather than how long you watch it.

What the probe attaches to

The tool — vllm-coldstart-probe, Rust, Apache-licensed, built on aya — attaches two families of probe.

On the kernel side, four syscall tracepoints, each traced on both enter and exit so I can compute how long each call spent in the kernel: openat, read, mmap, close. These are the syscalls a model loader actually uses to get bytes off disk.

On the userspace side, four entry points in the CUDA driver library libcuda.so, each traced with a uprobe on entry and a uretprobe on return, so I get the duration spent inside each driver call: cuInit, cuModuleLoadData, cuMemAlloc_v2, cuLaunchKernel. (cuMemAlloc_v2 is the real exported ABI symbol; the unsuffixed name you write in C is a header macro.)

Both families write the same fixed-layout event into one ring buffer:

#[repr(C)]
pub struct SyscallEvent {
    timestamp_ns: u64,
    pid: u32,
    tid: u32,
    syscall_nr: u32, // real syscall number, or >= 1000 for a uprobe id
    kind: u8,        // 0 = enter, 1 = exit
    ret: i64,
}

Reusing one event type for both keeps the userspace side simple: one ring buffer, one reader, one JSONL format. The syscall_nr field doubles as a discriminator — anything >= 1000 is a libcuda call, and the analysis splits on that boundary. The kernel-side functions are generated by a small macro so adding a syscall or a CUDA symbol is one line, not a copy-pasted function body:

macro_rules! define_uprobe {
    ($short:ident, $event_id:expr) => {
        ::paste::paste! {
            #[uprobe]
            pub fn [<probe_ $short>](_ctx: ProbeContext) -> u32 {
                emit_uprobe($event_id)
            }
            #[uretprobe]
            pub fn [<probe_ $short _ret>](_ctx: RetProbeContext) -> u32 {
                emit_uprobe_ret($event_id)
            }
        }
    };
}

There were two non-obvious fights to get here. A kernel-side PID filter using a BPF map kept getting dead-code-eliminated by the Rust toolchain regardless of the usual workarounds, so filtering happens in userspace on the drained stream instead. And #[uretprobe] takes a RetProbeContext, a different type from the #[uprobe] entry side’s ProbeContext — a one-line fix once the compiler tells you, but the kind of thing that isn’t in the examples.

The capture: Mistral-7B on an A10, cache dropped

The capture host is a Lambda Labs A10 (24 GB, Ampere), Ubuntu 22.04, kernel 6.8. RunPod, which I’d used for earlier work, turns out not to grant the capabilities eBPF needs in its consumer pods — no CAP_BPF, no CAP_PERFMON, no /sys/kernel/tracing — so a full VM was required. The probe is a 4.3 MB statically-linked musl binary, scp’d onto the box and run under sudo.

Two things had to be right for the numbers to mean anything.

First, the page cache. If you load the model once to download it and then measure the second load, the 14 GB is already in RAM and read returns in microseconds from cache — you measure nothing about disk. So I drop the page cache (echo 3 > /proc/sys/vm/drop_caches) immediately before the measured run, forcing real SSD reads.

Second — and this was the surprise that cost me a run — the PID. vLLM’s v1 architecture does the actual model load and CUDA work in a separate EngineCore subprocess, spawned after my probe has already attached. My first capture filtered on the launcher PID and recorded zero libcuda events, despite all four uprobes attaching cleanly. The CUDA calls were all coming from a child process I wasn’t watching. The fix was a --pid 0 mode that captures every process and filters to the target’s process tree afterward — the robust way to handle a target that forks its real workers after you attach.

Where the time isn’t, part one: kernel I/O is 7%

With the cache dropped and the whole process tree captured, here is the kernel-side cost over the ~18-second cold start:

syscall	calls	total time	p50	max
read	23,137	1037 ms	1.1 us	44 ms
openat	19,454	154 ms	2.6 us	1.4 ms
mmap	1,846	16 ms	4.8 us	1.0 ms
close	16,803	16 ms	0.8 us	0.3 ms

Total kernel I/O time: about 1.22 seconds. Out of eighteen. read dominates the kernel side at 1037 ms — that is the 14 GB actually coming off the SSD — but even that is under 6% of cold start. The whole I/O story, every byte moved through every syscall, is 7%.

One detail worth noting: only 1,846 mmap calls against 23,137 read calls. Mistral’s safetensors are being streamed in with read() in chunks, not memory-mapped. That is a vLLM/safetensors loading-strategy choice, and it is the kind of thing you only see when you count the syscalls.

So the disk is not where cold start goes. If you bought faster NVMe to fix your cold-start time, you bought at most 7% of headroom.

Where the time isn’t, part two: the volume of CUDA calls

If it isn’t the disk, the next suspect is the CUDA setup. vLLM makes a lot of driver calls during cold start. Here is the libcuda side, durations measured entry-to-return:

function	calls	total time	p50	max
cuLaunchKernel	947	1337 ms	4.2 us	1287 ms
cuInit	6	123 ms	5 us	123 ms
cuMemAlloc_v2	179	34 ms	106 us	2 ms
cuModuleLoadData	1	1 ms	780 us	1 ms

cuLaunchKernel is called 947 times — vLLM is launching kernels for every shape it profiles during warmup. The total is 1337 ms. So the kernels are the cost, right?

Look at the p50: 4.2 microseconds. The median kernel launch is instant. 947 launches at 4 us each is about 4 ms. The 1337 ms total is not spread across the launches — it is in the tail.

Where the time actually is: one kernel launch, and the gaps

The max column is the tell. One single cuLaunchKernel took 1287 ms — 1.29 seconds in one call, against a median of 4 microseconds. So I went looking at what that one call was, in the capture data.

It is not the first launch. It is the 516th of 947, about 7.2 seconds into the warmup, with 515 fast launches before it. During its 1.29 seconds the process issues almost no syscalls — two reads, nothing else — and the one cuModuleLoadData of the whole run happens almost two seconds after it, not before. So it is not disk I/O, not first-call JIT compilation writing a kernel cache, and not a module load feeding this launch. Those were my first three guesses and the data ruled out all three.

What it most likely is: a synchronization point. CUDA kernel launches are asynchronous — cuLaunchKernel returns as soon as the kernel is queued, and the GPU works through the queue behind it. The 515 fast launches before this one queued work that the GPU was still chewing on. At launch 516 something forces the host to wait — a full queue, an allocation that needs prior work to finish, an implicit sync — and this single call pays the bill for all the accumulated GPU compute at once. The 1.29 seconds is not spent inside the launch; it is spent waiting for the GPU, accounted to the call that happened to block.

I can’t prove that from this capture alone — I’m not tracing cuStreamSynchronize or cudaDeviceSynchronize, which is exactly what would confirm it. What the data does prove is the negative: this cost is not I/O, not the first launch, not compilation-to-disk. cuInit is cleaner: one 123 ms call plus five no-op repeats, the driver initialising once.

Now add it up. Kernel I/O: 1.22 s. libcuda calls: 1.49 s. Together, under 2.8 seconds of an 18-second cold start. Everything the probe can directly attribute is about 15% of the wall time.

The other ~85% is in the gaps — the wall-clock time that elapses between the traced calls, where the process is neither in a syscall I’m watching nor inside a libcuda function I’m watching. That time is GPU compute that has already been launched and is running asynchronously, CUDA runtime and framework work above the driver, and plain Python: importing torch, building the model object, moving tensors. The probe sees the boundaries of that work, not its interior.

Does it hold as the model grows?

One model is an anecdote. If the claim is “cold start is not I/O and not the volume of calls,” the obvious test is whether it survives a bigger model. So I ran the same probe against Qwen2.5-Instruct in AWQ at three sizes — 7B, 14B, 32B — same family, same quantization, only the parameter count changing. (The 7B and 14B fit on the A10; the 32B needed an A100 40 GB.) Keeping quantization fixed matters: AWQ changes the kernel count dramatically versus FP16 — the 7B-AWQ issues 3,475 cuLaunchKernel calls where Mistral-7B in FP16 issued 947 — so mixing precisions would confound size with quantization. Holding it constant isolates size.

The load times:

model	params	load time	kernel I/O	cuLaunchKernel calls	cuMemAlloc calls
7B-AWQ	7B	18.97 s	~1.86 s	3,475	161
14B-AWQ	14B	25.02 s	~2.27 s	6,547	372
32B-AWQ	32B	28.36 s	~1.93 s	8,691	418

The headline number: parameters grow 4.6x from 7B to 32B, load time grows 1.5x. Cold start scales strongly sub-linearly with model size.

Three things hold across the sweep, and one is worth being careful about.

cuInit and cuModuleLoadData are fixed costs — they do not move with model size (cuInit is ~80–140 ms at every size, the variation is run-to-run noise). The driver does not care how big your model is.

Kernel I/O does not grow with size either. read is in the 1.5–1.9 s band at all three sizes — the 32B has roughly four times the parameters of the 7B and reads them in the same wall-clock time. AWQ weights are small enough that the byte-count difference disappears into SSD bandwidth; the load is not I/O-bound at any size in this range.

The number of operations does scale with size: cuLaunchKernel count rises monotonically 3,475 → 6,547 → 8,691, cuMemAlloc 161 → 372 → 418. More model, more kernels and more allocations — but sub-linearly, not proportionally.

The careful part: the time inside those launches does not scale cleanly. Total cuLaunchKernel time was 1.77 s / 3.41 s / 2.81 s — the 14B spent more total launch time than the larger 32B. That is the same effect from the single-model section: the launch time is dominated by a few synchronization outliers, and where the sync tax lands is noisy run-to-run, not a function of model size. So the honest statement is precise: the count of operations scales sub-linearly with size; the time stays dominated by fixed init and the noisy sync tax, not by the volume of weights. If you size your cold-start budget by parameter count, you will over-provision for the big models and mis-predict the variance on all of them.

What this means for cold start

The actionable version, and it held across every model I tried: making cold start faster is not a storage problem and not a “make fewer CUDA calls” problem. Driver init is one fixed cost, flat across sizes. The single 1.29 s launch and the ~85% in the gaps are both, most likely, the GPU actually computing the warmup — vLLM profiling every batch shape it might see — plus the Python and framework work that drives it.

That points the optimisation work away from the disk and toward the warmup itself. If the dominant cost is the GPU grinding through warmup compute, the levers are reducing or caching that warmup: CUDA graph capture and replay so the shapes are not re-profiled every cold start, persisting a compiled or warmed state, or keeping a replica warm rather than paying cold start at all. The Python and framework time is partly what enforce_eager=True — which I used, to keep the trace readable — deliberately leaves on the table; a production config trades some of it back. None of these are disk fixes. If your mental model said “cold start is I/O,” you’d have spent your effort in the 7%.

What the probe can’t see yet

The honest limit: 85% of cold start is in the gaps, and “GPU compute, runtime, and Python” is a bucket, not an attribution. Narrowing it means more probe points — more libcuda entry points, the CUDA runtime layer, uprobes into libtorch — to push the boundary of what’s measured further into that gap. Each one converts some of the “between the calls” time into named, timed work. That is the next round.

There is also a structural ceiling, and the 1.29 s launch is the clearest example of it. A uretprobe times wall-clock duration inside a driver call, but a cuLaunchKernel returns when the kernel is queued, not when it finishes on the GPU. Most launches return in microseconds because the queueing is all that happens on the host; the one that blocked for 1.29 s did so because something made the host wait for the device. To attribute that wait properly — to say how much is queued compute versus an explicit sync — you have to correlate host-side launches with device-side completion, which needs CUPTI or NVML on the same clock. That is exactly the boundary my earlier /proc-plus-NVML profiler worked at, from the other side.

Which is the through-line. The process-level profiler stops at the process boundary and looks down; this one starts at the kernel and driver and looks up. Cold start is one of the places where you need both, because the cost is split across exactly the seam where most tools stop. The probe is on GitHub, Apache-licensed, with the analysis script that produced every number above. If you run inference cold starts at scale and have measured them differently — or found the device-side cost somewhere I haven’t looked — I’d be interested to hear about it.