All posts
GPU Infrastructure

Your GPU Dashboard Says 100% Utilized. It's Lying. Welcome to DCGM.

Every post about GPU incidents starts with 'the dashboards looked fine.' That's the problem. nvidia-smi GPU utilization tells you a kernel ran — not whether the silicon is doing work. The metrics that actually matter, the DCGM + Prometheus stack that exposes them, and the queries and alerts that catch real GPU failures.

By Sharon Sahadevan··17 min read

You roll out a new inference deployment. The Grafana panel says GPU utilization is pinned at 100%. The capacity-planning meeting concludes the obvious: the GPUs are saturated, you need more of them, file the procurement ticket.

Then someone measures actual throughput. The H100 is serving 40% of the tokens-per-second it should. The GPU that is "100% utilized" is mostly waiting. You did not have a capacity problem. You had a measurement problem, and it was about to cost you a six-figure GPU order.

This is the most expensive lie in GPU infrastructure, and almost every team believes it. nvidia-smi utilization — and the metric most dashboards graph — does not mean what you think it means. Every post in this series opens with the same sentence: the dashboards looked fine. This is the post about why the dashboards lie, and what to put on them instead.

Why GPU utilization is the wrong number#

The metric everyone watches is utilization.gpu from nvidia-smi, surfaced as DCGM_FI_DEV_GPU_UTIL in the exporter. NVIDIA's own definition: the percent of time over the past sample period during which one or more kernels was executing on the GPU.

Read that again. It is a time-based occupancy flag, not a work measurement. If a single tiny kernel runs on 1 of an H100's 132 streaming multiprocessors for the entire sampling window, this metric reads 100%. The other 131 SMs sit idle. The dashboard says "fully utilized." The silicon says "mostly asleep."

This is not a rare edge case. It is the normal state of LLM inference. Autoregressive decode launches a stream of small kernels, one per token, each of which barely touches the GPU's compute capacity but keeps a kernel resident the whole time. GPU_UTIL saturates at 100% while the SMs run at 15% occupancy and the tensor cores — the units that actually cost you money — are nearly idle.

KEY CONCEPT

DCGM_FI_DEV_GPU_UTIL answers "was a kernel running?" not "how much of the GPU was working?" For LLM inference it is almost always pinned near 100% and tells you nothing. The metrics that answer the real question all live in the DCGM profiling (PROF) namespace. If your GPU dashboard only graphs GPU_UTIL, you are flying blind at full instrument confidence.

The chain of "utilization" metrics, from least to most honest:

  • DCGM_FI_DEV_GPU_UTIL — was any kernel active in the window. The liar. Coarse, binary-ish, near-100% under any steady load.
  • DCGM_FI_PROF_GR_ENGINE_ACTIVE — ratio of time the graphics/compute engine was active. Slightly better, still coarse.
  • DCGM_FI_PROF_SM_ACTIVE — fraction of time at least one warp was resident on an SM, averaged across all SMs. Now you see breadth: are you using all the SMs or one of them?
  • DCGM_FI_PROF_SM_OCCUPANCY — fraction of resident warps relative to the SM maximum, averaged. Now you see depth: are the SMs you touch actually full?
  • DCGM_FI_PROF_PIPE_TENSOR_ACTIVE — fraction of cycles the tensor pipe was active. For LLM serving and training, this is the number that correlates with the dollar value you extract from the GPU. A tensor-active of 0.6 on an H100 doing matmul-heavy work is healthy; 0.05 means you are paying for a Ferrari to sit in traffic.

A worked example. An H100 inference pod reports:

DCGM_FI_DEV_GPU_UTIL          = 100   # "fully utilized"
DCGM_FI_PROF_SM_ACTIVE        = 0.18  # only ~18% of SM-time has a live warp
DCGM_FI_PROF_SM_OCCUPANCY     = 0.11  # and those SMs are 11% full
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE = 0.04 # tensor cores almost entirely idle
DCGM_FI_PROF_DRAM_ACTIVE      = 0.71  # memory bus is the busy part

The story those five numbers tell: this workload is memory-bandwidth-bound, not compute-bound. The GPU is spending its time shuttling weights and KV cache between HBM and the SMs, and the compute units are starved waiting for data. Throwing more GPUs at it scales the bottleneck linearly but wastes most of every new GPU. The right move is a memory-side optimization — bigger batches to amortize the weight reads, quantization to shrink the bytes moved, or the KV-cache architecture changes from the disaggregation post. GPU_UTIL alone would have sent you to the procurement portal instead.

Compute-bound vs memory-bound: the distinction that decides your bill#

This is the single most valuable thing GPU observability buys you, so it deserves its own section.

LLM inference has two phases with opposite resource profiles:

  • Prefill (processing the prompt) is compute-bound. Large matmuls over the whole prompt. You will see high PIPE_TENSOR_ACTIVE, high SM_OCCUPANCY. Here, more compute (or higher batch parallelism) helps.
  • Decode (generating tokens one at a time) is memory-bandwidth-bound. Each token reads the entire model's weights and the growing KV cache from HBM to produce one token of output. Arithmetic intensity is terrible. You will see high DRAM_ACTIVE, low PIPE_TENSOR_ACTIVE. Here, more compute does nothing; only batching (to reuse the weight reads across more sequences) or reducing bytes-moved helps.

If you cannot see DRAM_ACTIVE and PIPE_TENSOR_ACTIVE side by side, you cannot tell which phase dominates your traffic, and you cannot tell whether your next optimization should target compute or memory. Most production inference is decode-dominated and therefore memory-bound — which is exactly why the vLLM gpu_memory_utilization and KV cache levers move the needle and "buy a faster GPU" often does not.

PRO TIP

A 30-second triage rule for any GPU workload: graph DCGM_FI_PROF_PIPE_TENSOR_ACTIVE against DCGM_FI_PROF_DRAM_ACTIVE. Tensor high, DRAM moderate → compute-bound, you are using the GPU well. DRAM high, tensor low → memory-bound, optimize bytes-moved (batching, quantization, KV cache) before buying hardware. Both low while GPU_UTIL is 100% → you have a stall (small kernels, launch overhead, host-side bottleneck, or CPU starvation in the dataloader).

The DCGM stack#

DCGM (Data Center GPU Manager) is NVIDIA's GPU telemetry and health library. The piece you deploy is dcgm-exporter: a daemon that reads DCGM's metrics and exposes them in Prometheus format on :9400/metrics. It runs as a DaemonSet — one per GPU node — and on Kubernetes it is shipped and managed by the NVIDIA GPU Operator, the same operator that manages the driver stack, device plugin, and MIG configuration.

The data path:

NVIDIA driver + DCGM library
        │  (samples SM/tensor/memory/thermal counters)
        ▼
dcgm-exporter  (DaemonSet, one pod per GPU node, :9400/metrics)
        │  (Prometheus exposition format, one series per GPU per metric)
        ▼
Prometheus  (scrapes via ServiceMonitor / PodMonitor)
        │
        ▼
Grafana + Alertmanager

The killer feature on Kubernetes: with the kubernetes mapping enabled, dcgm-exporter attaches pod, namespace, and container labels to every GPU metric by correlating the GPU device with the pod that holds the device-plugin allocation. That turns "GPU 3 is hot" into "the llama-serve pod in team-b is cooking GPU 3" — per-tenant GPU attribution, which is the foundation of both incident response and cost allocation.

The metrics that actually matter#

You do not want all of DCGM's ~100 fields on every dashboard. Here is the production-grade shortlist, grouped by what question it answers.

Real compute work (not the lie):

DCGM_FI_PROF_SM_ACTIVE          # breadth: fraction of SM-time with a live warp
DCGM_FI_PROF_SM_OCCUPANCY       # depth: how full those SMs are
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE # tensor core activity — the money metric
DCGM_FI_PROF_DRAM_ACTIVE        # HBM bandwidth utilization

Memory capacity (ties directly to OOM and KV cache):

DCGM_FI_DEV_FB_USED             # framebuffer (HBM) used, MiB
DCGM_FI_DEV_FB_FREE             # framebuffer free, MiB
DCGM_FI_DEV_FB_TOTAL            # total HBM, MiB

Thermal, power, and throttling (the silent throughput killers):

DCGM_FI_DEV_GPU_TEMP                  # core temp, °C
DCGM_FI_DEV_MEMORY_TEMP               # HBM temp, °C (often the first to throttle)
DCGM_FI_DEV_POWER_USAGE               # watts drawn
DCGM_FI_DEV_ENFORCED_POWER_LIMIT      # the cap it is being held to
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS    # bitmask: WHY clocks were reduced

Health and hardware faults (the page-the-human metrics):

DCGM_FI_DEV_XID_ERRORS                # last XID error code — hardware/driver faults
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL         # correctable ECC errors (single-bit)
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL         # uncorrectable ECC errors (double-bit) — serious

Interconnect (matters for multi-GPU / tensor parallelism):

DCGM_FI_PROF_PCIE_TX_BYTES / RX_BYTES     # host <-> GPU traffic
DCGM_FI_PROF_NVLINK_TX_BYTES / RX_BYTES   # GPU <-> GPU traffic

Two of these deserve special attention because they catch failure modes nothing else does.

DCGM_FI_DEV_CLOCK_THROTTLE_REASONS is a bitmask. When a GPU mysteriously loses throughput with no code change, this is usually why: the clocks got pulled back. The bits that matter:

  • HW_THERMAL / SW_THERMAL — the card is too hot and is protecting itself. Common cause: a failing fan, a hot rack, or a noisy neighbor in an adjacent slot. Your tokens-per-second drops 30% and your latency SLO breaks, with zero application-level signal.
  • HW_POWER_BRAKE / SW_POWER_CAP — hitting the power limit. Often a datacenter power-capping policy you did not know existed.
  • SYNC_BOOST — held back to stay in lockstep with peer GPUs in a sync group.

DCGM_FI_DEV_XID_ERRORS surfaces NVIDIA XID codes — the GPU's equivalent of a kernel oops. A few you must recognize:

  • XID 79 — "GPU has fallen off the bus." The GPU is gone. The node needs to be cordoned and the hardware checked. Without this metric, you find out when pods start CrashLooping with opaque CUDA errors.
  • XID 48 / 63 / 64 — double-bit ECC error and page retirement. The memory is degrading.
  • XID 13 — graphics engine exception, frequently an application bug (illegal memory access) but sometimes hardware.

XID errors are the difference between "we proactively drained a dying node" and "an entire training run wasted 14 hours before someone noticed the loss had gone NaN."

WARNING

The PROF (profiling) metrics — SM_ACTIVE, SM_OCCUPANCY, PIPE_TENSOR_ACTIVE, DRAM_ACTIVE — are the most valuable ones, and they are not in dcgm-exporter's default counter set on every install. They require DCGM's profiling subsystem, which carries a small sampling overhead and, on some older GPUs, cannot run concurrently with an external profiler like Nsight. You must explicitly include them in the exporter's counters CSV. If your PIPE_TENSOR_ACTIVE series is missing, this is why — the metric isn't broken, it was never enabled.

Deploying it on Kubernetes#

If you run the GPU Operator, dcgm-exporter is already there as a DaemonSet. You customize which metrics it exports with a ConfigMap of counters and a reference in the ClusterPolicy.

The counters ConfigMap (CSV: field, type, help text). This is where you add the profiling metrics that aren't on by default:

apiVersion: v1
kind: ConfigMap
metadata:
  name: dcgm-custom-counters
  namespace: gpu-operator
data:
  counters.csv: |
    # Real compute work
    DCGM_FI_PROF_SM_ACTIVE,          gauge, fraction of SM-time with an active warp
    DCGM_FI_PROF_SM_OCCUPANCY,       gauge, fraction of resident warps vs max
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, fraction of cycles the tensor pipe was active
    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, fraction of cycles the memory interface was active
    # Memory capacity
    DCGM_FI_DEV_FB_USED,             gauge, framebuffer memory used (MiB)
    DCGM_FI_DEV_FB_FREE,             gauge, framebuffer memory free (MiB)
    # Thermal / power / throttle
    DCGM_FI_DEV_GPU_TEMP,            gauge, GPU temperature (C)
    DCGM_FI_DEV_MEMORY_TEMP,         gauge, HBM temperature (C)
    DCGM_FI_DEV_POWER_USAGE,         gauge, power draw (W)
    DCGM_FI_DEV_CLOCK_THROTTLE_REASONS, gauge, current clock throttle reasons bitmask
    # Health
    DCGM_FI_DEV_XID_ERRORS,          gauge, last XID error code
    DCGM_FI_DEV_ECC_DBE_VOL_TOTAL,   counter, total uncorrectable ECC errors

Point the GPU Operator at it:

# In the NVIDIA GPU Operator ClusterPolicy
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: cluster-policy
spec:
  dcgmExporter:
    enabled: true
    config:
      name: dcgm-custom-counters   # the ConfigMap above
    # Map GPU metrics to the pod that holds the GPU — per-tenant attribution
    env:
    - name: DCGM_EXPORTER_KUBERNETES
      value: "true"

Scrape it with a Prometheus Operator ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
  labels:
    release: kube-prometheus-stack   # match your Prometheus serviceMonitorSelector
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
  - port: gpu-metrics
    interval: 15s
    relabelings:
    # Promote the exporter's pod/namespace labels to clean names
    - sourceLabels: [exported_pod]
      targetLabel: pod
    - sourceLabels: [exported_namespace]
      targetLabel: namespace

The exported_pod / exported_namespace relabeling matters: by default the exporter's pod attribution lands on exported_* labels (because Prometheus already injects its own pod for the exporter daemon itself). Promote them so your queries can group GPU work by the workload pod, not the exporter pod.

The PromQL that earns its keep#

A handful of queries cover most of what you need.

Find GPUs that are "100% utilized" but doing no real work — the procurement-ticket trap:

# High GPU_UTIL but tensor cores idle: a stall or a memory-bound workload
DCGM_FI_DEV_GPU_UTIL > 90
  and on (gpu, Hostname) DCGM_FI_PROF_PIPE_TENSOR_ACTIVE < 0.1

HBM pressure per pod — the early warning for the CUDA OOMs from the fragmentation post:

# Fraction of HBM used, grouped by the workload pod
sum by (namespace, pod) (DCGM_FI_DEV_FB_USED)
  / sum by (namespace, pod) (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)

Detect thermal/power throttling (any non-zero throttle reason bitmask):

DCGM_FI_DEV_CLOCK_THROTTLE_REASONS > 0

Fleet-wide real utilization for capacity planning — average tensor activity, the honest version of "how busy are the GPUs":

avg(DCGM_FI_PROF_PIPE_TENSOR_ACTIVE)

Alerts that page on real failures#

Most teams alert on the wrong GPU signals (or none). The ones worth a page:

groups:
- name: gpu-health
  rules:
  # A GPU has hard-faulted. Drain the node.
  - alert: GPUXidError
    expr: DCGM_FI_DEV_XID_ERRORS > 0
    for: 1m
    labels: { severity: critical }
    annotations:
      summary: "XID {{ $value }} on GPU {{ $labels.gpu }} / {{ $labels.Hostname }}"
      description: "Hardware/driver fault. Cordon the node and investigate."

  # Uncorrectable memory errors — the GPU is degrading.
  - alert: GPUUncorrectableECC
    expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[10m]) > 0
    for: 1m
    labels: { severity: critical }

  # Silent throughput killer: clocks pulled back for thermal/power reasons.
  - alert: GPUThrottling
    expr: DCGM_FI_DEV_CLOCK_THROTTLE_REASONS > 0
    for: 5m
    labels: { severity: warning }
    annotations:
      summary: "GPU {{ $labels.gpu }} throttled — throughput is silently degraded"

  # Paying for idle silicon: high GPU_UTIL, near-zero real compute, for a while.
  - alert: GPUExpensiveAndIdle
    expr: |
      avg_over_time(DCGM_FI_DEV_GPU_UTIL[15m]) > 80
        and on (gpu, Hostname)
      avg_over_time(DCGM_FI_PROF_PIPE_TENSOR_ACTIVE[15m]) < 0.05
    for: 15m
    labels: { severity: warning }
    annotations:
      summary: "GPU looks busy but tensor cores are idle — investigate stall or right-size"

That last alert, GPUExpensiveAndIdle, is the one that pays for the whole stack. It is the difference between a $40K/month H100 doing work and a $40K/month H100 keeping a kernel warm while it waits. The cost framing of that gap is the entire prompt economics story, viewed from the hardware side.

The MIG wrinkle#

If you run MIG, GPU-level monitoring is actively misleading. nvidia-smi's top-line numbers report at the physical-GPU level, but your workloads live in MIG instances. DCGM exposes per-instance metrics with GPU_I_ID / GPU_I_PROFILE labels, so you can see that the 1g.20gb instance serving team A is at 80% tensor-active while team C's instance next door is idle. Without per-instance DCGM metrics, multi-tenant MIG is a black box and per-tenant cost attribution — the main reason to run MIG — is impossible. This is the monitoring half of that decision.

WAR STORY

A team I worked with ran a 24-GPU H100 inference fleet and graphed exactly one GPU metric on their main dashboard: DCGM_FI_DEV_GPU_UTIL. It sat at 95–100% around the clock, so the standing assumption was "we are GPU-bound, we need more capacity." A $1.2M expansion was drafted. Before signing, someone added the profiling metrics. PIPE_TENSOR_ACTIVE averaged 0.06. DRAM_ACTIVE averaged 0.74. The fleet was memory-bandwidth-bound and the compute units were starving — the batch sizes were tiny because a misconfigured client was sending one request at a time with no batching window. They fixed the batching, tensor-active climbed to 0.55, throughput nearly tripled on the same hardware, and the expansion was shelved. Separately, the new dashboard immediately caught two GPUs that had been silently thermal-throttling for weeks behind a failing chassis fan — slow, invisible, and quietly breaking the p99 SLO on whatever landed on those nodes. One metric hid all of it. Four metrics exposed all of it.

Common mistakes#

Graphing only GPU_UTIL. It is near-100% under any steady load and tells you nothing about whether the GPU is doing useful work. It is the most-watched and least-informative GPU metric in existence.

Not enabling the profiling counters. SM_ACTIVE, PIPE_TENSOR_ACTIVE, and DRAM_ACTIVE are the entire point and they are not always on by default. A dashboard without them is a speedometer that only shows "engine on."

No throttle-reason alert. Thermal and power throttling silently steal 20–40% of throughput with zero application-level signal. Latency SLOs break and nobody knows why. The signal is one gauge away.

No XID alerting. A GPU "falling off the bus" (XID 79) manifests as cryptic CUDA errors and CrashLoops minutes later, far from the root cause. The XID metric names the fault directly, at the moment it happens.

Losing pod attribution. Without the kubernetes mapping and the exported_pod relabeling, GPU metrics are anonymous. "A GPU is hot" is not actionable; "the team-b/llama-serve pod is hot" is.

Forgetting MIG instances. GPU-level metrics on a MIG node average across isolated tenants and tell you nothing useful. Use the per-instance series.

Scraping too aggressively. A 1-second scrape on the profiling metrics adds measurable overhead and rarely buys insight over 15s. GPU thermals and utilization trends do not need sub-second resolution.

Confusing power draw with utilization. A GPU can pull near its power limit while doing low-value work, and a memory-bound workload can be the bottleneck while drawing moderate power. Power is a useful corroborating signal, not a work measurement.

The mental model#

nvidia-smi utilization is a presence detector: it tells you a kernel was in the room, not whether it was doing anything. For interactive desktop GPU work that distinction rarely mattered, which is why the metric was good enough for a decade. For datacenter LLM serving and training, where the workload deliberately keeps a stream of small kernels resident, the distinction is everything — and the metric is wrong in the specific, expensive direction of "looks busy, isn't."

The fix is not more dashboards. It is the right four numbers: how broadly the SMs are engaged (SM_ACTIVE), how deeply (SM_OCCUPANCY), whether the tensor cores — the part you actually pay for — are working (PIPE_TENSOR_ACTIVE), and whether memory bandwidth is the real bottleneck (DRAM_ACTIVE). Those four, plus HBM-used, throttle reasons, and XID errors, are a complete production GPU observability posture. Everything else is detail.

Once you can see real compute versus memory bandwidth, the rest of the GPU stack stops being guesswork. You know whether to batch or buy. You know whether a slow node is throttling or dying. You know which tenant is using what, and you can bill it. And you stop signing procurement orders for GPUs that are already there — they were just never doing the work the dashboard claimed.

The dashboards looked fine. That was always the problem. Now you can build dashboards that are actually fine.


GPU telemetry with DCGM, the profiling metrics that distinguish compute-bound from memory-bound workloads, per-MIG-instance monitoring, throttle and XID fault detection, and wiring dcgm-exporter into Prometheus and Grafana on Kubernetes are covered in depth in the Production GPU Infrastructure on Kubernetes course. The serving-side metrics — queue depth, batch occupancy, TTFT — that sit on top of these hardware signals are the spine of the LLM Inference on Kubernetes course, and the cost-attribution side lives in GPU Cost Optimization. Related reading: MIG vs Time-Slicing for why per-instance metrics are non-negotiable on shared GPUs, Tuning vLLM gpu_memory_utilization for the HBM-pressure tuning these metrics feed, GPU Memory Fragmentation for why FB_USED and free memory diverge, Prompt Economics for the dollar cost of every idle tensor core, and Your HPA Scales LLM Pods on CPU. They're Either Idle or On Fire. for why GPU utilization is the wrong autoscaling signal and which serving-layer metrics to scale on instead.

More in GPU Infrastructure