Your H100 Serves Three Teams Now. MIG or Time-Slicing? Pick Wrong and the Answer Hurts.
MIG is hardware partitioning. Time-slicing is software multiplexing. They are not interchangeable. The production decision walk-through, the H100 profile math, the GPU Operator config, and the migration path most teams hit.
You have one H100 and three teams that want to use it.
Team A runs a 7B inference service that needs about 20 GB of HBM and steady throughput. Team B runs ad-hoc Jupyter notebooks for ML experiments. Team C runs nightly batch jobs that occasionally need the full GPU.
The Kubernetes answer to "share a GPU" is two completely different mechanisms with the same goal. Most platform teams pick one without understanding the other, then get bitten by the choice six months later when a production incident exposes the assumption.
MIG (Multi-Instance GPU) is hardware partitioning. The silicon is sliced into smaller GPUs. Workloads in different slices cannot touch each other.
Time-slicing is software multiplexing. The kubelet schedules multiple pods onto one GPU. They take turns. They share memory. There is no isolation.
The same word ("share") describes radically different things. This post is the mechanics, the production decision, the migration path most teams hit, and the operational gotchas that turn this from a config knob into an incident.
What MIG actually does#
MIG, introduced on the A100 in 2020 and expanded on H100/H200/B200, is hardware-level GPU partitioning. NVIDIA's silicon designers carved the GPU into independent execution units (SM slices, memory channels, L2 cache slices) that can be assigned to up to seven separate instances.
On an 80GB H100, the available MIG profiles are:
- 1g.10gb: 1 compute slice, 10 GB HBM. Smallest unit; 7 of these per GPU.
- 1g.20gb: 1 compute slice, 20 GB HBM. Same compute, more memory.
- 2g.20gb: 2 compute slices, 20 GB HBM.
- 3g.40gb: 3 compute slices, 40 GB HBM.
- 4g.40gb: 4 compute slices, 40 GB HBM.
- 7g.80gb: Full GPU (degenerate case; effectively no MIG).
You configure a GPU into a fixed partition layout (e.g., "seven 1g.10gb instances," or "one 3g.40gb plus one 4g.40gb"). The layout cannot be changed without a GPU mode reset (disruptive: any running workload dies). Newer NVIDIA GPU Operator versions automate the reset via labels, but the disruption is real.
Inside each instance:
- Memory is isolated. A workload in instance 1 cannot read or write memory in instance 2. A memory leak in one workload cannot starve another.
- Compute is isolated. A workload in instance 1 cannot consume more than its allocated SM slices, even if other instances are idle.
- L2 cache is partitioned. Cache misses in one instance do not pollute another's working set.
- Failures are isolated. A CUDA kernel that hangs in instance 1 does not affect instances 2 through 7.
The Kubernetes representation is a separate resource per profile. Pods request nvidia.com/mig-1g.10gb: 1 (or mig-2g.20gb, etc.). The scheduler picks a node with that profile available.
This is the strongest isolation guarantee NVIDIA offers on a single GPU. Two tenants running in two MIG instances are, from the silicon's perspective, on different GPUs.
What time-slicing actually does#
Time-slicing is implemented by the NVIDIA device plugin, not the GPU silicon. The plugin advertises a single physical GPU as N "shares" (typically 2 to 16). The kubelet schedules N pods onto the GPU. Each pod sees the full GPU and the full HBM.
When multiple pods run kernels, the GPU's compute scheduler interleaves them, roughly round-robin. One pod runs for a few milliseconds, then yields. The next runs. And so on.
What time-slicing does NOT do:
- No memory isolation. All pods share the same HBM. Pod A can allocate 60 GB; pod B then fails with CUDA OOM. The pods do not know about each other; the OOM looks like the pod's own fault.
- No compute isolation. Pod A can launch enormous kernels that monopolize the GPU. Pod B's latency spikes during pod A's burst.
- No fairness guarantee. The "round-robin" is naive. Pods with longer kernels effectively get more compute. NVIDIA's MPS (Multi-Process Service) helps but is a separate feature with its own complications.
- No failure isolation. A CUDA crash in pod A can affect pod B (rare but documented; depends on driver version and crash type).
The Kubernetes representation is unchanged: pods still request nvidia.com/gpu: 1. The plugin just lies about how many GPUs exist (one physical GPU shows up as N).
The pattern this resembles is CPU oversubscription in a container runtime without cgroup quotas. Multiple processes run, the OS interleaves them, nobody is isolated, "it works on my laptop" is the typical experience.
The most important sentence in this post: MIG isolates at the silicon level. Time-slicing does not isolate at all. Treat them as different products, not different settings of the same product. The word "share" hides the difference and is the source of most production incidents in this space.
When MIG is the right answer#
MIG wins for any production workload where one tenant's behavior must not affect another tenant.
Multi-tenant inference platforms. You serve 7B-class models for five product teams. Each team gets a 1g.20gb MIG instance on shared H100s. Team A's traffic spike is invisible to teams B-E. p99 latency for each team is independent. SLOs are achievable per team.
Latency-sensitive workloads with hard SLOs. A real-time inference service that must hit 50ms p99 cannot tolerate compute interference from a co-located batch job. MIG guarantees the SMs and memory are dedicated.
Cost attribution. Each MIG instance is a discrete resource. Team A consumed 1000 hours of mig-1g.20gb; bill them for it. Time-slicing makes per-team cost impossible to compute (no isolation, no way to know who used what compute).
Small-model fleets. A 7B model at FP16 fits in 14 GB; a 1g.20gb MIG instance holds it with room for KV cache. Putting one model per MIG instance gives you 7x density on an H100 compared to one model per full GPU. The cost economics are dramatic.
Predictable capacity planning. You know exactly how many concurrent workloads fit per node (the MIG profile decides). Capacity is countable, not approximate.
Compliance and isolation requirements. Regulated industries (finance, healthcare) often need demonstrable workload isolation. MIG provides hardware-level proof. Time-slicing provides nothing.
When time-slicing is the right answer#
Time-slicing wins for workloads where isolation is not required and opportunistic GPU sharing matters more.
Development and experimentation clusters. Five ML engineers share a couple of GPUs for ad-hoc work. None of them needs the full GPU; they want to start a Jupyter notebook and get a GPU now. Time-slicing lets the kubelet pack pods onto whatever GPU is available, with no upfront partition planning.
Bursty batch workloads that rarely overlap. CI/CD model evaluation that runs for 10 minutes every hour. Three jobs occasionally need a GPU; rarely at the same time. Time-slicing lets them share without artificially partitioning.
Workloads that need the full GPU sometimes. A team that trains 70B models occasionally needs a full H100 for a multi-hour run, and the rest of the time runs smaller experiments. Time-slicing lets the kubelet allocate the full GPU when only one pod is running, and share it when multiple are. MIG cannot do this (the partition layout is fixed; you cannot dynamically merge MIG instances).
Internal tooling with relaxed SLOs. Internal LLM-powered tools where occasional latency spikes are acceptable. Cost matters more than predictability.
Cost-sensitive single-team environments. One team owns the GPU. Isolation between their own workloads is not needed. Time-slicing maximizes utilization across their internal workload mix.
The trap most teams fall into: using time-slicing for production multi-tenant inference because it is the easier default. Then two workloads spike at the same time, one OOMs the GPU, both fail. The other team's incident becomes your incident.
The GPU Operator angle#
Both modes are configured via the NVIDIA GPU Operator (the production deployment pattern for managing the NVIDIA driver stack on Kubernetes). The relevant CR fields:
MIG configuration (per node label):
# Set the MIG profile via node label
$ kubectl label node $NODE nvidia.com/mig.config=all-balanced
# The GPU Operator's MIG manager reads the label and reconfigures
# the GPU(s) on that node to the requested layout.
# "all-balanced" on H100 = three 2g.20gb instances + one 1g.20gb instance.
# "all-1g.10gb" = seven 1g.10gb instances (maximum density).
# Custom layouts are supported via mig-parted ConfigMap.
Time-slicing configuration:
# In the GPU Operator's ClusterPolicy
devicePlugin:
config:
name: time-slicing-config
default: any
---
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
data:
any: |-
version: v1
sharing:
timeSlicing:
renameByDefault: false
resources:
- name: nvidia.com/gpu
replicas: 4 # Advertise each GPU as 4 shares
The hybrid pattern most production fleets converge on:
- Node pool A: H100s with
mig.config=all-1g.20gbfor small-model inference (7 tenants per GPU). - Node pool B: H100s with
mig.config=all-3g.40gbfor medium-model inference (2 tenants per GPU). - Node pool C: H100s with no MIG, no time-slicing, for full-GPU workloads (large-model inference, training).
- Node pool D: A100s or smaller GPUs with time-slicing enabled for dev/CI workloads.
Pods request the specific resource they need. The scheduler routes them to the right pool via the resource name. Heterogeneity is the production reality.
The production migration path most teams hit#
The typical evolution I see:
Phase 1: One team, one GPU each. Initial deployment. Each workload gets a dedicated GPU. Utilization is 20 to 40 percent. The bill arrives. Someone asks "can we share GPUs?"
Phase 2: Time-slicing turned on naively. Easy config change. Density doubles or triples. Utilization climbs. For weeks, everything looks great.
Phase 3: The first production incident. A batch job on the same GPU as a critical inference service allocates 60 GB of HBM during a training experiment. The inference service hits CUDA OOM. p99 latency for the inference service goes from 200ms to "the pod is restarting." Customers see 503s.
Phase 4: Realization that time-slicing has no isolation. Postmortem reads "we treated time-slicing as if it provided isolation; it does not." Team migrates critical workloads to MIG. Time-slicing stays for dev.
Phase 5: The hybrid steady state. Production pools on MIG. Dev/CI/batch pools on time-slicing or full-GPU. Sometimes a third pool for "full GPU, no sharing" for the largest workloads. Most production GPU fleets I have seen end up here.
If your team is in Phase 2, the incident is coming. The earlier you skip to Phase 5, the fewer postmortems you write.
A team I worked with ran a multi-tenant inference platform on 30 H100s with time-slicing (4 shares per GPU). They served 12 internal product teams. Average GPU utilization was 65 percent. Looked great on dashboards. Then a new team onboarded with a workload that occasionally allocated 50 GB of HBM during cold starts. Existing tenants started seeing CUDA OOMs at random. They blamed their own code. After two weeks of confusion (and three escalations), the platform team realized the new tenant's allocations were starving their neighbors. The fix: migrate the inference fleet to MIG with all-1g.20gb profiles, isolate each tenant in a dedicated 20 GB instance. Time-slicing kept for dev and CI. Incidents stopped. Per-tenant cost attribution became trivial. The lesson: time-slicing without isolation is not a sharing strategy, it is an undetected dependency between unrelated workloads.
Common mistakes#
Treating time-slicing as "MIG-lite." They are different products solving different problems. Time-slicing is opportunistic packing without isolation; MIG is partitioning with isolation. Substituting one for the other is the source of most production incidents.
Choosing a MIG profile that is too small or too large. A 1g.10gb instance cannot hold a 13B FP16 model (26 GB). A 7g.80gb instance is the whole GPU. Pick the profile that fits the workload's memory and compute needs plus 20 to 30 percent headroom for KV cache, allocator overhead, and burst.
Putting too many tenants on time-slicing. Beyond about 4 shares per GPU, the round-robin overhead starts to hurt p99 latency for everyone. The configurable replicas number lies about how many workloads you can actually pack.
Forgetting that MIG profile changes require a GPU reset. Reconfiguring a node's MIG layout kills every workload on that node. Plan it like a node drain.
No DCGM per-instance monitoring. Standard NVIDIA-smi metrics report at the GPU level. For MIG, you need per-instance metrics (DCGM has them). Without per-instance monitoring, you cannot tell which tenant is consuming what.
Mixing MIG and non-MIG workloads on the same node by accident. Once MIG is enabled on a GPU, that GPU is partitioned, period. You cannot also schedule a full-GPU workload on it. Plan node pools accordingly.
Trying to use NCCL across MIG instances. MIG instances are isolated; NCCL all-reduce across them does not work well (or at all, depending on the workload). Tensor parallelism for large models cannot span MIG instances on the same GPU.
Underestimating the cost-attribution win from MIG. In multi-tenant environments, the ability to bill per tenant by MIG instance hour is often worth the migration on its own. Time-slicing makes cost attribution effectively impossible.
The decision framework#
Four questions, in order. Each answer narrows the choice.
Q1: Is this production with externally-facing SLOs?
- Yes → MIG (or dedicated full GPU).
- No → either works; keep evaluating.
Q2: Do multiple tenants share the GPU?
- Yes, and they have independent SLOs → MIG.
- Yes, but they trust each other and accept interference → time-slicing OK.
- No, single tenant → time-slicing fine for opportunistic packing.
Q3: Does any single workload need more than one MIG instance worth of memory or compute?
- Yes → that workload needs a full GPU, not MIG. Either dedicated GPU or time-slicing where the workload can opportunistically claim the whole GPU.
- No → MIG profile sized to the workload.
Q4: Do you need per-workload cost attribution?
- Yes → MIG. Each instance is a billable unit.
- No → either works.
The summary that fits on a sticky note: production multi-tenant = MIG; dev/CI = time-slicing; large single workloads = dedicated GPU.
The mental model#
MIG and time-slicing are not two settings of the same feature. They are two different products that NVIDIA happens to ship in the same driver stack and the same Kubernetes device plugin.
MIG is the GPU analog of running multiple VMs on a hypervisor with hardware-enforced memory isolation. Two VMs on the same physical machine are independent, and the hardware proves it.
Time-slicing is the GPU analog of running multiple processes on a single Linux box with no cgroup quotas. Processes coexist by convention, not by enforcement. One bad actor ruins it for everyone.
You would not run a multi-tenant compute platform on a Linux box without quotas in 2026. You should not run a multi-tenant GPU platform on time-slicing in 2026 either.
The right shape for most production GPU fleets is multi-mode: MIG for the inference tier where isolation is load-bearing, dedicated full GPUs for the training tier where workloads need everything, time-slicing for the dev tier where opportunistic packing is the point. Each mode does what it is good at. Mixing them on the same node is rarely the right move; node pools with explicit labels are.
If your fleet is single-mode today (everything on time-slicing or everything on dedicated full GPUs), the migration to a multi-mode fleet is one of the highest-ROI infrastructure moves you can make. Cost goes down, isolation goes up, capacity planning becomes tractable.
The decision framing matters more than the config syntax. Once you internalize "MIG isolates, time-slicing does not," every downstream choice (node pools, profiles, scheduling, monitoring, cost attribution) follows.
GPU sharing strategies, MIG profile selection per workload type, the NVIDIA GPU Operator deep dive, multi-tenant scheduling with taints and tolerations, and the migration patterns from time-slicing to MIG are covered in depth in the Production GPU Infrastructure on Kubernetes course. The cost-attribution and spot economics side lives in GPU Cost Optimization. The LLM-specific serving patterns that sit on top of these GPUs are the spine of the LLM Inference on Kubernetes course. Related reading: Your LLM Cluster Is at 90% HBM and 60% Is KV Cache for what happens when MIG-sized partitions are not enough, Tuning vLLM gpu_memory_utilization Without Breaking Production for the per-instance memory tuning that pairs with MIG, and Your GPU Dashboard Says 100% Utilized. It's Lying. for the per-MIG-instance DCGM monitoring without which multi-tenant GPU sharing is a black box.
More in GPU Infrastructure
Your GPU Dashboard Says 100% Utilized. It's Lying. Welcome to DCGM.
Every post about GPU incidents starts with 'the dashboards looked fine.' That's the problem. nvidia-smi GPU utilization tells you a kernel ran — not whether the silicon is doing work. The metrics that actually matter, the DCGM + Prometheus stack that exposes them, and the queries and alerts that catch real GPU failures.
Read postYour 8B Model Won't Fit on an A100 With 50GB Free. Welcome to GPU Memory Fragmentation.
The model weights are 16GB. The KV cache is 20GB. The A100 has 80GB. nvidia-smi shows 50GB free. The next request OOMs. The CUDA memory allocator's fragmentation story most ML engineers never learn.
Read post