Profiling LLM inference: what your /proc sampler isn’t telling you

Most LLM inference profilers that read /proc/<pid> to track memory and CPU usage are quietly monitoring the wrong process. The number they show is correct — it just describes a transient shell wrapper, not the long-lived worker doing the actual inference. The output looks plausible, the GPU panel says the GPU is hot, and nothing breaks. The data just doesn’t mean what the chart label says it means.

This is a class of bug, not a one-off. I ran into it while validating inferscope v0.2 on an NVIDIA RTX L4, and the shape of the failure is general enough that it likely shows up in any pipeline that captures a PID via $! after a bash background launch. This post walks through what I observed, why it happens at the kernel level, and what a defensible fix looks like at the tooling layer.

The signal that didn’t match the reality

inferscope is a Rust profiler I’m building for LLM inference engines: it drives an engine over its OpenAI-compatible HTTP API, captures per-token timing, and (in v0.2) also samples GPU resources via NVML. The CPU-side sampler reads /proc/<pid>/status and /proc/<pid>/stat every 50 ms for the duration of the probe, mirroring the ADR that defines the sampling cadence.

For the v0.2 validation I rented an RTX L4 on RunPod, brought up a llama-server (llama.cpp) instance serving Qwen 2.5 0.5B Q4 on the GPU, and ran inferscope --gpu against it. The first output looked like this:

Probe summary
  Tokens generated      80
  Generation rate       375.7 tokens/s

Process resource usage (6 samples)
  RSS                peak 2 MiB  mean 2 MiB
  CPU utilization    mean 0%
  Threads            1 throughout

GPU resource usage (6 samples)
  VRAM               peak 1.25 GiB
  SM utilization     peak 83%  mean 41%

The GPU panel reports a workload doing real work: 1.25 GiB of VRAM resident, SM utilisation peaking at 83%, sustained throughput of 375 tokens per second. The process panel describes a process barely alive: 2 MiB of RSS, zero CPU activity, one thread.

These two views cannot both be true of the same process. A program serving an LLM at 375 tokens/s has to do some host work — tokenisation, request handling, HTTP serialisation, memory management around the KV cache. It will not fit in 2 MiB of RSS or run in a single idle thread. So either the GPU panel is wrong, or the process panel is monitoring something other than the inference worker.

A quick check confirmed the second:

$ nvidia-smi
NVIDIA L4 ... 1282 MiB / 23034 MiB ... 0% util  <-- idle
              ... 1.25 GiB ...                  <-- matches inferscope GPU
... no other processes on the GPU ...

The GPU side was accurate. The process side was not the inference worker.

What the kernel was actually telling us

The way I had launched the server matched the pattern most tutorials suggest:

./build/bin/llama-server -m model.gguf --port 8080 \
  --n-gpu-layers 99 > /tmp/llama-server.log 2>&1 &
SERVER_PID=$!

That $! is the source of the bug. Bash’s $! returns the PID of the most recent background job, which under output redirection is not the binary you launched. Bash creates a subshell to handle the redirection setup, that subshell becomes the background job, and only then does the subshell exec into the real binary — or in some cases, fork+exec, leaving the subshell around as a parent that exists only to maintain job control.

In the Qwen 2.5 0.5B run, the difference between the two PIDs was directly visible in /proc:

$ cat /proc/2252/status | grep -E 'Name|VmRSS|Threads'
Name:   bash
VmRSS:    2144 kB
Threads:    1

$ cat /proc/2254/status | grep -E 'Name|VmRSS|Threads'
Name:   llama-server
VmRSS:  523804 kB
Threads:    52

$ cat /proc/2254/status | grep PPid
PPid:   2252

PID 2252, the value $! had handed me, was bash. PID 2254 was the actual llama-server, with the bash wrapper as its parent. Both processes were running; both were valid /proc entries; the sampler was faithfully reading /proc/2252/status and reporting exactly what was there.

This is the failure mode I want to be explicit about: the sampler was not broken, the data was not wrong, the contract was not violated. The PID I supplied was a real PID of a real, currently-running process. The tooling can’t tell that the user intended a different PID. From the sampler’s point of view, asking it to monitor PID 2252 and getting back “2 MiB, 0% CPU, 1 thread” is the correct answer to the question that was asked.

This is why the bug is interesting. It can’t be fixed by being more careful with the sampler. The sampler is already correct.

Why this shows up in more pipelines than just mine

If you supply a PID to your profiling tooling using $! after a background launch with output redirection, you are likely seeing a similar pattern. The shape varies — some shells, some configurations, and some binaries do an in-place exec that keeps the PID consistent, so the bug stays hidden until you move environments. Some inference servers (vLLM, Ollama, llama.cpp built differently) fork additional workers, so even a correct top-level PID misses the workers that hold the model weights. Some Docker entrypoints add another wrapper layer.

The general statement: in a pipeline where you launch an inference server in the background, capture its PID programmatically, and hand that PID to anything that samples /proc, you should verify at least once on real hardware that the PID actually points at the worker doing the inference. The right check is two lines:

$ cat /proc/$SERVER_PID/status | grep -E 'VmRSS|Threads'
VmRSS:    523804 kB
Threads:  52

A real LLM inference worker has VmRSS at least in the hundreds of MiB (the model weights and runtime live there even with GPU offload) and threads well above one. Anything else is a wrapper or an idle process and should not be the PID you’re sampling.

The fix, in two pieces

There are two reasonable places to handle this kind of failure: the documentation and the runtime.

The documentation fix is the easy half. The runbook now tells the user not to use $!, and gives them the alternative:

SERVER_PID=$(pgrep -x llama-server | head -1)

pgrep -x matches on the exact process name rather than on the command line, so it returns the long-lived worker rather than a wrapper or a grep of the command itself. The runbook also includes the sanity check above as a pre-flight step before invoking inferscope.

The runtime fix is the more interesting half. After the probe completes, inferscope now runs a single pessimistic heuristic against the collected timeline. If every sample shows RSS below 10 MiB and exactly one thread and zero CPU jiffies (user plus system), it emits a stderr warning:

inferscope: warning: monitored PID looks idle across all 6
samples (RSS < 10 MiB, 1 thread, 0 CPU jiffies). The --pid
argument may point to a wrapper shell rather than the
actual workload. Verify with:
cat /proc/<pid>/status | grep -E 'VmRSS|Threads'.

Each of the three conditions on its own would be a poor signal. A small daemon can legitimately use less than 10 MiB; a single-threaded program is normal; a momentarily idle process will register 0 jiffies in a 50 ms window. Requiring all three to hold across every sample in the timeline is what keeps the false-positive rate down. A real inference worker never matches all three for the entire duration of a 200-ms probe run; an idle bash wrapper matches all three trivially.

The warning is informational only. It does not abort the run or change the report — the user still sees the GPU section, the timing section, and the (wrong) process section. The point is not to refuse to produce output, but to make sure the user can’t accept the misleading output without being told it might be misleading.

Empirical validation, A/B

The two runs below were taken back-to-back on the same RTX L4 pod, against the same llama-server instance, with the same prompt and --max-tokens value. The only difference is the PID supplied to --pid.

PID 2252 (the bash wrapper, the wrong PID, the case the warning should catch):

inferscope: warning: monitored PID looks idle across all 6
samples (RSS < 10 MiB, 1 thread, 0 CPU jiffies). ...

Process resource usage (6 samples)
  RSS                peak 2 MiB  mean 2 MiB
  CPU utilization    mean 0%
  Threads            1 throughout

GPU resource usage (6 samples)
  VRAM               peak 1.25 GiB
  SM utilization     peak 91%  mean 45%

PID 2254 (the real llama-server, the right PID, no warning):

Process resource usage (7 samples)
  RSS                peak 511 MiB  mean 510 MiB
  CPU utilization    mean 73%
  Threads            52 throughout

GPU resource usage (7 samples)
  VRAM               peak 1.25 GiB
  SM utilization     peak 81%  mean 23%

The GPU panels match — slightly different distributions because each run captures a different slice of work, but the same ballpark (1.25 GiB resident, peak utilisation in the 80–90% range). The process panels are now what you would expect from a real inference server: half a gigabyte resident for the model weights and runtime, 52 threads from the worker pool and HTTP handlers, sustained CPU utilisation during the request because even GPU-offloaded inference does meaningful host-side work.

The warning fires for the wrong PID and stays silent for the right one. The validation cost was about $0.40 of RunPod credit across two sessions on the L4 ($0.39/hr).

What this means beyond inferscope

Three things are worth taking away even if you never use inferscope.

The first is that $! is the wrong PID-capture pattern when the launch uses background plus output redirection. The right one is pgrep -x <binary-name> after a short sleep to let the worker stabilise. This applies to llama.cpp, vLLM, Ollama, TGI, and any other server you launch this way from a shell script.

The second is that “the process panel and the GPU panel disagree” is a high-signal symptom. If your profiling dashboard shows the GPU is hot and the process is idle, the process being monitored is not the one talking to the GPU. This is true even when the profiling tooling is correct — particularly when the profiling tooling is correct, because a correct tool will faithfully report the state of whatever process you pointed it at.

The third is more general. Validation on real hardware surfaces a class of bugs that no amount of unit testing in a clean room will catch. The inferscope test suite (109 tests green with --features gpu-nvidia enabled) passed in CI from day one. The PID bug was not visible until the tool was running against a real LLM workload on a real GPU. The inexpensive fix — half a day of RunPod time, two commits, and a runbook entry — was only possible because the validation budget existed. If your tooling has no real-hardware validation step in its release process, every release is on the honour system.

What I’d do differently next time

If I were starting the v0.2 NVIDIA path from scratch, I would not add a process-tree-aware sampler in v0.2. That is the correct long-term fix — the sampler should aggregate /proc/<pid>/task/*/stat for the supplied PID plus its descendants — but it is meaningful design work and has its own correctness gotchas (cgroup boundaries, namespace crossings, descendant PID enumeration races). The pessimistic-AND warning is the right thing for v0.2 because it is small, defensible, and informational. The full process- tree sampler is on the v0.3 list.

I would also write the validation runbook earlier — before the first real-hardware test, not after. The runbook I wrote post-hoc captures all the steps I had to figure out the first time; if it had existed up front, the PID bug would have been documented in the runbook before it ever fired in inferscope output.

Roadmap

v0.3 adds AMD GPU sampling via amd-smi, the successor tool AMD is positioning as the long-term replacement for rocm-smi. rocm-smi’s JSON output is documented as unstable across ROCm upgrades, so investing parser work on it now would mean refactoring it within months; ADR-005 records that decision in full.

v0.4 and beyond is engine-internal instrumentation: phase-level attribution of time and memory inside the inference engine itself (KV cache management, attention, sampling), which is where black-box profiling stops being useful and engine-aware tooling has to take over.

inferscope is Apache-2.0, on GitHub, and v0.2.0 is the current release.