The DevOpsBeast Blog

Production engineering notes.

Field notes on Kubernetes, GPUs, Linux, and the rest of the production stack, from engineers who run real infrastructure.

GPU Infrastructure··17 min read

Your GPU Dashboard Says 100% Utilized. It's Lying. Welcome to DCGM.

Every post about GPU incidents starts with 'the dashboards looked fine.' That's the problem. nvidia-smi GPU utilization tells you a kernel ran — not whether the silicon is doing work. The metrics that actually matter, the DCGM + Prometheus stack that exposes them, and the queries and alerts that catch real GPU failures.

Read post