Every post about GPU incidents starts with 'the dashboards looked fine.' That's the problem. nvidia-smi GPU utilization tells you a kernel ran — not whether the silicon is doing work. The metrics that actually matter, the DCGM + Prometheus stack that exposes them, and the queries and alerts that catch real GPU failures.
MIG is hardware partitioning. Time-slicing is software multiplexing. They are not interchangeable. The production decision walk-through, the H100 profile math, the GPU Operator config, and the migration path most teams hit.
The model weights are 16GB. The KV cache is 20GB. The A100 has 80GB. nvidia-smi shows 50GB free. The next request OOMs. The CUDA memory allocator's fragmentation story most ML engineers never learn.