Why Identity Is the Hardest Problem in DevOps
Your CI pipeline broke last week because GitHub rotated its OIDC issuer URL. Your AWS role's trust policy still expected the old one. The fix took 30 minutes; the investigation took 6 hours, because nobody on the team understood why a single string change could break production deploys cluster-wide.
Identity is the layer of infrastructure that decides who is allowed to do what. It sits underneath everything else: authentication for humans, service accounts for workloads, federation between clouds, certificates for service-to-service traffic, tokens for API calls. When it works, nobody notices. When it breaks, almost everything breaks at once, and the failure is rarely diagnosable from the symptom.
This lesson is the framing for the rest of the course. Why identity is uniquely hard, why "just use OAuth" is not the answer, and the categories of bug that show up over and over in production. Once you have the framing, the remaining 47 lessons are filling in the layers.
The problem
Identity is the hardest production problem in DevOps for five concrete reasons. Each of them has cost real teams real outages.
1. Identity is the dependency of every other dependency. Authentication is in the request path of every API call. RBAC is in the path of every kubectl command. mTLS certs are in the path of every service-to-service call. When the database is slow, your app degrades. When the IdP is slow, everything degrades, because nothing can authenticate. When the IdP is down, nothing is even degraded; everything is just broken, including the tools you would use to fix it.
2. Identity bugs are silent until they are catastrophic. A misconfigured IAM role works fine until the day someone tries to assume it. A wrong audience claim on a JWT validates fine until the IdP rotates its key, then every token starts failing at once. A certificate's notAfter is in the future until the moment it is not. Most identity systems have no graceful degradation: they go from "fully working" to "fully broken" with no in-between.
3. The whole landscape is fragmented. OAuth 2.0, OpenID Connect, SAML 2.0, mTLS, Kerberos, LDAP, AWS IAM, Kubernetes RBAC, ServiceAccount tokens, IRSA, Workload Identity, SPIFFE/SPIRE, Vault, Okta. Each is a different vocabulary, a different set of standards, a different failure mode. A real production system uses a dozen of these together. Knowing one of them is not enough.
4. The standards are intentionally underspecified. OAuth 2.0 is a framework, not a protocol. Implementations differ in what they accept, how they handle edge cases, and which optional features they support. The famous JWT alg: none vulnerability existed because the spec allowed an algorithm to mean "no algorithm." Real-world interoperability depends on implementers making the same choices, which they often do not.
5. The threat model is adversarial. Identity is the highest-value target for attackers because compromising it gives them legitimate-looking access to everything. The 2020 SolarWinds attack, the 2022 Okta breach, the 2023 LastPass incident: all turned on identity. Defenders are racing against attackers who specifically study identity systems for weaknesses.
The motivating scenario at the top is variant 3 in concrete form. GitHub's OIDC issuer URL changed. AWS IAM role trust policies that pinned the old URL stopped accepting tokens. Every CI pipeline that federated through GitHub Actions started failing simultaneously. The fix is one line; understanding which line, and why, requires knowing OIDC discovery, IAM trust policies, GitHub Actions OIDC, and the audit logs to spot the pattern. This is what makes identity work feel disproportionately hard: the actual change is small, the context to know what to change is enormous.
The unifying property of identity systems is that they are at the bottom of every dependency chain. When they work, nothing else has to think about them. When they break, nothing else can compensate. This asymmetry is why identity work earns "below the visibility line" status in most organizations and why identity engineers earn outsized salaries in the few organizations that recognize the work.
How it works
Identity is the answer to three questions, and most production bugs come from confusing which question is being asked at any given moment.
The three identity questions every system answers
Who are you? Proving identity. Done by passwords, MFA, certificates, JWT signatures, mTLS handshakes. Output: a verified identity (a username, a SA name, a SPIFFE ID).
What are you allowed to do? Mapping identity to permissions. Done by RBAC, ABAC, policies, IAM. Output: a yes or no for a specific action.
What did you actually do? Recording for accountability. Done by audit logs, SIEM, traces. Output: a queryable history of who did what when.
Hover to expand each layer
Three things to internalize:
Authentication answers nothing about authorization. An authenticated user is verified, not allowed. A common bug class: a service authenticates the request correctly, then assumes "since you are authenticated, you can do this," skipping the authorization check. The Atlassian and Confluence breaches in 2022-2023 had this shape: the auth layer worked, the authz layer was missing for specific endpoints.
Authorization decisions can be wrong even with correct authentication. RBAC bindings can grant too much (an "admin" role accidentally given to a contractor). Custom authz code can have logic bugs (the classic IDOR: "you can edit any document because we did not check ownership"). Policy engines can have misconfigured rules (an OPA Rego policy that returns allow = true by default).
Audit is often skipped during the design phase and bolted on later, badly. A system without audit cannot answer "did this happen?" after the fact. A system with audit but without queryability is roughly the same. A system with audit and queryability but stored in the same trust boundary as the actor (cluster-admin can delete the audit log) is also roughly the same.
The three questions are answered at different layers, by different components, with different failure modes. A complete production identity story requires getting all three right and getting their interactions right.
In practice
Five categories of identity bug recur in production. If you have not seen them yet, you will.
1. The federation typo. A trust policy or OIDC config has a tiny mismatch (issuer URL, audience claim, subject claim) that prevents tokens from being accepted. The token is valid; the validator does not trust this issuer. Fixes are usually one-line. Investigations are usually hours.
2. The expired credential. A certificate, a token, an IAM role's external ID, or some other time-bound thing expired and the workload using it kept trying. Symptoms cascade: the workload fails, retries swarm the IdP, the IdP gets slow, other workloads start timing out.
3. The over-permissive grant. A role, policy, or RBAC binding grants more than intended. Often invisible until exploited. The spread can be exponential: a too-broad service account is used by a workload, which is then compromised, which then uses its inherited permissions to grant itself more.
4. The auth-without-authz hole. A new endpoint was added; the developer assumed framework-level auth was enough; no per-endpoint authz check was wired. Anyone authenticated can hit it. Common in API-first organizations where new endpoints ship weekly.
5. The shared identity. Two workloads share a service account, an IAM role, or a token. When one is compromised, the blast radius covers both. When you need to revoke, you cannot do it for one without breaking the other. Per-workload identity from day one prevents this; retrofitting it later is hard.
These are not new bugs. The OWASP Top 10 has had "broken authentication" and "broken access control" near the top for over a decade. The reason they persist is that identity is genuinely hard and the surface area keeps growing.
Configuration examples
A real example of the federation-typo bug, in IAM trust policy form:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"token.actions.githubusercontent.com:sub": "repo:mycompany/myrepo:ref:refs/heads/main"
}
}
}
]
}
Three load-bearing values:
Federated: the IAM identity-provider ARN. Must match the OIDC issuer GitHub Actions uses for this account.:subcondition: the OIDC subject claim. The string format is rigid;repo:mycompany/myrepo:ref:refs/heads/maindoes not matchrepo:mycompany/myrepo:environment:production.Effect: Allow: nothing else allows this action by default; missing this denies the role assumption silently.
Any of the three being subtly wrong produces the same error message: An error occurred (AccessDenied) when calling the AssumeRoleWithWebIdentity operation: Not authorized to perform sts:AssumeRoleWithWebIdentity. The diagnostic is reading the policy character by character.
A diagnostic that catches federation typos before deployment:
# Get the OIDC provider's actual issuer
aws iam list-open-id-connect-providers
# Note the ARN
# Check the OIDC config that GitHub publishes
curl -s https://token.actions.githubusercontent.com/.well-known/openid-configuration \
| jq '.issuer, .jwks_uri'
# Validate trust-policy structure
aws iam get-role --role-name my-deploy-role \
| jq '.Role.AssumeRolePolicyDocument'
Cross-reference the three. Mismatches are the bug.
Common mistakes
- Treating identity as a one-time setup. IdP configurations rot. Trust policies that worked in 2023 stop working in 2026 because the IdP changed its issuer or claims. Periodic verification is mandatory.
- Granting "admin" because least-privilege is hard. The shortcut that creates the over-permissive grant. Worth the upfront work to scope properly.
- Skipping audit because "we will add it later." Identity audit is needed for incident response, not for compliance theater. Without it, you cannot answer "what did this account do" after a breach.
- Per-environment IdP configs duplicated by hand. Drift between dev, staging, and prod IdP configs is one of the most common sources of "works in dev, broken in prod" bugs. Manage IdP config as code (Terraform).
- Trusting any token that validates as well-formed. Validation is necessary but not sufficient. Audience claim, issuer claim, expiry, and signature must all be checked. Skipping any one is a known CVE class.
- Sharing identities across workloads. Per-workload identity is the right default. Shared identities make blast-radius management impossible.
- Treating identity as a security team problem. Identity bugs hit availability, performance, and operability, not just security. The whole platform team needs literacy in this.
Why is identity considered the hardest problem in DevOps? Walk through three production bugs you have seen or could imagine, and what makes them specifically identity bugs rather than general bugs.