Resource Management and Debugging

Your Kubernetes cluster is running Pods. Everything works perfectly in development. Then you deploy to production.

Your Pod crashes immediately. Or it stays Pending forever. Or it consumes all memory and gets evicted. You don't know why—you just see error states and no explanation.

This lesson teaches you to read what the cluster is trying to tell you. Kubernetes provides signals about Pod failures: status fields, events, logs, and resource constraints. Learning to interpret these signals is the difference between a 5-minute fix and hours of frustration.

Concept 1: Resource Requests and Limits

Before diving into debugging, you need to understand how Kubernetes allocates resources.

The Mental Model: Requests vs Limits

Think of resource management like renting an apartment:

Request: "I need at least 2 bedrooms"—the landlord won't accept your application if they have fewer. This is your GUARANTEED minimum.
Limit: "My apartment can have at most 3 bedrooms"—you don't need more than this. If you try to use 4, you get evicted.

In Kubernetes:

resources:
  requests:
    memory: "256Mi"    # Guaranteed minimum
    cpu: "100m"        # for scheduling decisions
  limits:
    memory: "512Mi"    # Maximum allowed
    cpu: "500m"        # can't exceed this

Key Principle: A Pod cannot be scheduled on a node unless that node has at least the REQUESTED amount of free resources. Limits prevent a Pod from monopolizing node resources.

Why This Matters

Requests enable fair scheduling. If you have 3 Pods:

Pod A requests 1 CPU
Pod B requests 1 CPU
Pod C requests 1 CPU

Kubernetes won't schedule all three on a 2-CPU node. Request prevents overcommitment.

Limits enable isolation. If Pod A starts consuming 2 CPUs (more than its limit), Kubernetes throttles it. Pod B doesn't starve because Pod A went rogue.

CPU and Memory Units

CPU:

1000m = 1 CPU core
100m = 0.1 CPU cores (100 millicores)
0.5 = half a CPU core (also written as 500m)

Memory:

1Mi = 1 mebibyte (~1 million bytes, technically 1048576 bytes)
1Gi = 1 gibibyte (~1 billion bytes)
256Mi = 256 mebibytes (typical for small services)
1Gi = 1 gibibyte (typical for memory-intensive services)

Always use Mi and Gi (binary) not MB and GB (decimal) in Kubernetes manifests. They're different.

Concept 2: Quality of Service (QoS) Classes

Kubernetes prioritizes which Pods to evict when a node runs out of resources. This priority is determined by the Pod's QoS class.

The Three QoS Classes

Guaranteed (Highest Priority)

resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
  limits:
    memory: "256Mi"
    cpu: "100m"

When requests equal limits, the Pod is Guaranteed. Kubernetes evicts Guaranteed Pods LAST. Use this for critical workloads (databases, control planes).

Burstable (Medium Priority)

resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
  limits:
    memory: "512Mi"
    cpu: "500m"

When requests < limits, the Pod is Burstable. Kubernetes evicts Burstable Pods second. Use this for normal workloads (most services, agents).

BestEffort (Lowest Priority)

resources: {}
  # No requests or limits

When a Pod has no requests or limits, it's BestEffort. Kubernetes evicts these FIRST when memory pressure occurs. Only use this for batch jobs, not for Pods that need to stay running.

The Eviction Decision

Imagine your cluster is out of memory and needs to evict a Pod:

Check BestEffort Pods first—evict one
If memory pressure continues, evict Burstable Pods
Only if nothing else works, evict Guaranteed Pods

This hierarchy ensures critical workloads stay running when resources are tight.

Concept 3: Common Pod Failure States

Pods fail in predictable ways. Each failure state has a specific cause and fix pattern.

CrashLoopBackOff

What you see:

NAME    READY   STATUS             RESTARTS   AGE
myapp   0/1     CrashLoopBackOff   5          2m

What it means: The container started, then crashed, then restarted, then crashed again. This cycle repeats 5+ times.

Root causes:

Application error (bug in code)
Missing environment variable
Missing configuration file
Port already in use
Out of memory

Fix pattern:

Check logs: kubectl logs <pod-name> (shows why it crashed)
Check if limit was hit: kubectl describe pod <pod-name> (look for OOMKilled)
Fix the underlying issue in your manifest or code
Delete and recreate the Pod

ImagePullBackOff

What you see:

NAME    READY   STATUS             RESTARTS   AGE
myapp   0/1     ImagePullBackOff   0          1m

What it means: Kubernetes tried to pull the container image but failed.

Root causes:

Image doesn't exist in registry
Wrong image name or tag
Registry credentials missing (private images)
Network unreachable (can't reach registry)

Fix pattern:

Check the event: kubectl describe pod <pod-name> (look for "Failed to pull image")
Verify image name: kubectl get pod -o yaml <pod-name> (find exact image reference)
Test locally: docker pull <image-name> (can you pull it on your laptop?)
Fix the image reference in your manifest
Apply the manifest again

Pending

What you see:

NAME    READY   STATUS    RESTARTS   AGE
myapp   0/1     Pending   0          5m

What it means: Kubernetes is trying to schedule the Pod, but no node has enough resources.

Root causes:

Requested resources exceed cluster capacity
Node affinity requirements not met
Pod is waiting for PersistentVolume
Node taint prevents Pod scheduling

Fix pattern:

Check events: kubectl describe pod <pod-name> (shows why scheduling failed)
Check node resources: kubectl top nodes (are nodes overcommitted?)
Reduce Pod requests if they're too high
Add more nodes to the cluster
Check tolerations and affinity rules

OOMKilled

What you see (in describe output):

State:       Waiting
  Reason:    CrashLoopBackOff
LastState:   Terminated
  Reason:    OOMKilled
  Exit Code: 137

What it means: The container consumed more memory than its limit, so Kubernetes forcefully terminated it.

Root causes:

Application has a memory leak
Limit is too low for the workload
Processing unexpectedly large dataset

Fix pattern:

Increase memory limit (if limit is genuinely too low)
Profile the application (use tools like pprof for Python/Go)
Fix the memory leak in code
Change the application to process data in chunks instead of all at once

Concept 4: The Debugging Pattern

Kubernetes provides four signals for debugging. Learn to read them in order.

Signal 1: Pod Status

kubectl get pods

Output:

NAME              READY   STATUS             RESTARTS   AGE
nginx-good        1/1     Running            0          5m
nginx-crash       0/1     CrashLoopBackOff   3          2m
nginx-pending     0/1     Pending            0          1m

Status tells you WHAT is wrong (CrashLoopBackOff, Pending, etc.). But not WHY. Continue to Signal 2.

Signal 2: Events

kubectl describe pod <pod-name>

Output (partial):

Name:         nginx-crash
Namespace:    default
...
Events:
  Type    Reason     Age    Message
  ----    ------     ---    -------
  Normal  Created    2m20s  Created container nginx
  Normal  Started    2m19s  Started container nginx
  Warning BackOff    2m10s  Back-off restarting failed container

Events show WHEN things happened and provide clues (like "restarting failed container"). For Pending Pods, events reveal scheduling reasons.

Signal 3: Logs

kubectl logs <pod-name>

Output (if app crashed):

Traceback (most recent call last):
  File "app.py", line 5, in <module>
    connect_to_db()
  File "app.py", line 2, in connect_to_db
    raise Exception("Database not found")
Exception: Database not found

Logs show the APPLICATION'S error message. This is where you find the root cause (missing env var, code bug, etc.).

For Pending Pods, logs are empty (Pod never started). Check events instead.

Signal 4: Interactive Access

kubectl exec -it <pod-name> -- /bin/bash

When the above three signals aren't enough, jump into the running Pod and investigate directly.

# Inside the Pod
$ env          # Check environment variables
$ ls -la       # Check filesystem
$ ps aux       # Check running processes
$ curl localhost:8080  # Test internal services

This is your "last resort" debugging tool. Use when you need to poke around interactively.

Putting It Together: The Debugging Workflow

When a Pod fails:

Get status: kubectl get pods (What's the state?)
Describe: kubectl describe pod <name> (Why is it in that state? Are there events?)
Check logs: kubectl logs <name> (What did the application say?)
Investigate interactively: kubectl exec -it <name> -- /bin/bash (What's actually happening inside?)
Fix: Modify manifest or code based on findings
Apply: kubectl apply -f manifest.yaml
Verify: kubectl get pods (Did it work?)

This pattern works for 95% of Kubernetes debugging.

Practice 1: Diagnose CrashLoopBackOff

Create a Pod that crashes due to a missing environment variable.

Manifest (save as crash-loop.yaml):

apiVersion: v1
kind: Pod
metadata:
  name: crash-loop-app
spec:
  containers:
  - name: app
    image: python:3.11-slim
    command: ["python", "-c"]
    args:
      - |
        import os
        db_url = os.environ['DATABASE_URL']
        print(f"Connecting to {db_url}")
    resources:
      requests:
        memory: "64Mi"
        cpu: "50m"
      limits:
        memory: "128Mi"
        cpu: "100m"
  restartPolicy: Always

Deploy it:

kubectl apply -f crash-loop.yaml

Output:

pod/crash-loop-app created

Check status (after 30 seconds):

kubectl get pods

Output:

NAME              READY   STATUS             RESTARTS   AGE
crash-loop-app    0/1     CrashLoopBackOff   2          35s

Describe to see events:

kubectl describe pod crash-loop-app

Output (relevant section):

Events:
  Type     Reason            Age   Message
  ----     ------            ---   -------
  Normal   Scheduled         45s   Successfully assigned default/crash-loop-app
  Normal   Created           44s   Created container app
  Normal   Started           43s   Started container app
  Warning  BackOff           20s   Back-off restarting failed container

Check logs to see the actual error:

kubectl logs crash-loop-app

Output:

Traceback (most recent call last):
  File "<string>", line 2, in <module>
db_url = os.environ['DATABASE_URL']
KeyError: 'DATABASE_URL'

Diagnosis: The application expects DATABASE_URL environment variable but it's not set.

Fix: Add the environment variable to the manifest:

apiVersion: v1
kind: Pod
metadata:
  name: crash-loop-app
spec:
  containers:
  - name: app
    image: python:3.11-slim
    command: ["python", "-c"]
    args:
      - |
        import os
        db_url = os.environ['DATABASE_URL']
        print(f"Connecting to {db_url}")
    env:
    - name: DATABASE_URL
      value: "postgres://localhost:5432/mydb"
    resources:
      requests:
        memory: "64Mi"
        cpu: "50m"
      limits:
        memory: "128Mi"
        cpu: "100m"
  restartPolicy: Always

Apply the fix:

kubectl apply -f crash-loop.yaml

Output:

pod/crash-loop-app configured

Check status:

kubectl get pods

Output:

NAME              READY   STATUS    RESTARTS   AGE
crash-loop-app    1/1     Running   0          5s

Check logs to confirm it's working:

kubectl logs crash-loop-app

Output:

Connecting to postgres://localhost:5432/mydb

Notice the Pod transitions from CrashLoopBackOff to Running. The restart counter resets because the underlying issue is fixed.

Clean up:

kubectl delete pod crash-loop-app

Practice 2: Diagnose Pending Pod Due to Insufficient Resources

Create a Pod that requests more resources than available.

Manifest (save as pending-pod.yaml):

apiVersion: v1
kind: Pod
metadata:
  name: memory-hog
spec:
  containers:
  - name: app
    image: python:3.11-slim
    command: ["sleep", "3600"]
    resources:
      requests:
        memory: "100Gi"    # Way more than any node has
        cpu: "50"          # Way more cores than available
      limits:
        memory: "100Gi"
        cpu: "50"
  restartPolicy: Always

Deploy it:

kubectl apply -f pending-pod.yaml

Output:

pod/memory-hog created

Check status:

kubectl get pods

Output:

NAME         READY   STATUS    RESTARTS   AGE
memory-hog   0/1     Pending   0          10s

Describe to see why it's Pending:

kubectl describe pod memory-hog

Output (relevant section):

Events:
  Type     Reason            Age   Message
  ----     ------            ---   -------
  Warning  FailedScheduling  15s   0/1 nodes are available: 1 Insufficient memory (requires 100Gi, but nodes only have ~5Gi free).

Diagnosis: The Pod requests 100Gi of memory, but no node has that much available. Kubernetes cannot schedule it.

Fix: Reduce the resource requests to reasonable values:

apiVersion: v1
kind: Pod
metadata:
  name: memory-hog
spec:
  containers:
  - name: app
    image: python:3.11-slim
    command: ["sleep", "3600"]
    resources:
      requests:
        memory: "256Mi"    # Reasonable request
        cpu: "100m"
      limits:
        memory: "512Mi"
        cpu: "500m"
  restartPolicy: Always

Apply the fix:

kubectl apply -f pending-pod.yaml

Output:

pod/memory-hog configured

Check status:

kubectl get pods

Output:

NAME         READY   STATUS    RESTARTS   AGE
memory-hog   1/1     Running   0          5s

Pod transitions from Pending to Running.

Clean up:

kubectl delete pod memory-hog

Practice 3: Diagnose OOMKilled and Adjust Limits

Create a Pod that exceeds its memory limit.

Manifest (save as oom-pod.yaml):

apiVersion: v1
kind: Pod
metadata:
  name: memory-leak-app
spec:
  containers:
  - name: app
    image: python:3.11-slim
    command: ["python", "-c"]
    args:
      - |
        import time
        memory = []
        while True:
            # Allocate 50MB every 100ms
            memory.append(bytearray(50 * 1024 * 1024))
            time.sleep(0.1)
    resources:
      requests:
        memory: "256Mi"
        cpu: "100m"
      limits:
        memory: "256Mi"    # Limit memory to 256Mi
        cpu: "500m"
  restartPolicy: Always

Deploy it:

kubectl apply -f oom-pod.yaml

Output:

pod/memory-leak-app created

Check status immediately:

kubectl get pods

Output (within a few seconds):

NAME               READY   STATUS             RESTARTS   AGE
memory-leak-app    0/1     CrashLoopBackOff   1          5s

Describe to see the termination reason:

kubectl describe pod memory-leak-app

Output (relevant section):

State:          Waiting
  Reason:       CrashLoopBackOff
Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137
  Message:      The container was killed due to an out-of-memory condition.

Diagnosis: The application consumes memory faster than the 256Mi limit allows. Kubernetes kills it with OOMKilled.

Fix Options:

Increase the limit (if the application actually needs more memory):

resources:
  requests:
    memory: "512Mi"
    cpu: "100m"
  limits:
    memory: "1Gi"      # Increase limit
    cpu: "500m"

Fix the memory leak (if there's a bug):

args:
  - |
    import time
    memory = []
    while True:
        # Only keep last 5 allocations (500MB max)
        if len(memory) > 5:
            memory.pop(0)
        memory.append(bytearray(50 * 1024 * 1024))
        time.sleep(0.1)

For this example, let's increase the limit:

kubectl apply -f oom-pod.yaml

Check status:

kubectl get pods

Output:

NAME               READY   STATUS    RESTARTS   AGE
memory-leak-app    1/1     Running   0          5s

If the Pod stays running (not crashing), the limit was the issue. If it still crashes with the higher limit, there's a memory leak in the code that needs fixing.

Clean up:

kubectl delete pod memory-leak-app

Resource Management Best Practices

1. Always set requests and limits for production Pods

resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
  limits:
    memory: "512Mi"
    cpu: "500m"

This ensures fair scheduling and prevents Pod eviction.

2. Make requests equal to limits for critical workloads (Guaranteed QoS)

resources:
  requests:
    memory: "512Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    cpu: "250m"

This prevents your Pod from being evicted when other Pods fail.

3. Start conservative and increase based on monitoring

Set requests low (100m CPU, 128Mi memory) for new services, then monitor actual usage:

kubectl top pods    # Shows actual CPU/memory usage

Increase requests based on observed usage + 20% headroom.

4. Use namespaces to isolate resource quotas

apiVersion: v1
kind: Namespace
metadata:
  name: ai-services
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: ai-quota
  namespace: ai-services
spec:
  hard:
    requests.memory: "10Gi"    # All Pods in namespace combined
    requests.cpu: "5"

Try With AI

You now have the mental models and debugging workflow. Let's collaborate with AI to troubleshoot a complex scenario.

Setup: Deploy a multi-container Pod with intentional resource and configuration issues. Use kubectl commands to inspect it, then iterate with AI to fix problems.

Your Assignment:

Create this manifest (save as complex-pod.yaml):

apiVersion: v1
kind: Pod
metadata:
  name: multi-container-app
spec:
  containers:
  - name: web
    image: nginx:1.25
    ports:
    - containerPort: 8080
    env:
    - name: ENVIRONMENT
      value: "production"
    resources:
      requests:
        memory: "64Mi"
        cpu: "50m"
      limits:
        memory: "128Mi"
        cpu: "100m"
  - name: sidecar
    image: curlimages/curl:latest
    command: ["sleep", "3600"]
    resources:
      requests:
        memory: "32Mi"
        cpu: "25m"
      limits:
        memory: "64Mi"
        cpu: "50m"
  restartPolicy: Always

Step 1: Deploy and diagnose

kubectl apply -f complex-pod.yaml
kubectl get pods
kubectl describe pod multi-container-app
kubectl logs multi-container-app -c web
kubectl logs multi-container-app -c sidecar

Step 2: Ask AI for analysis

Tell AI: "I've deployed a multi-container Pod with nginx and curl sidecar. Here are the kubectl outputs: [paste describe and logs]. What QoS class is this Pod? How would you monitor resource usage? What would happen if CPU requests were set to 50 instead of 50m?"

Step 3: Interactive exploration

Jump into the Pod and verify:

kubectl exec -it multi-container-app -c web -- /bin/sh
$ ps aux              # Check if nginx is running
$ netstat -tlnp       # Check port bindings
$ env                 # Verify environment variables
$ exit

Step 4: Propose a modification

Based on AI's suggestions, modify the manifest to:

Change sidecar image to a production-ready one
Add a liveness probe to the web container
Adjust resource requests based on typical nginx usage

Ask AI: "Given that nginx typically uses 20-50m CPU and 50-100Mi memory in production, what requests and limits would you recommend? Should this be Guaranteed or Burstable?"

Step 5: Validate and explain

Apply your modified manifest:

kubectl apply -f complex-pod.yaml
kubectl get pods
kubectl top pod multi-container-app     # View actual resource usage

Explain to AI:

Why your resource choices match the QoS class you selected
What signals you'd monitor to detect problems before they become critical
How you'd adjust resources if you observed Pod eviction in a high-load scenario

Clean up:

kubectl delete pod multi-container-app

Concept 1: Resource Requests and Limits​

The Mental Model: Requests vs Limits​

Why This Matters​

CPU and Memory Units​

Concept 2: Quality of Service (QoS) Classes​

The Three QoS Classes​

The Eviction Decision​

Concept 3: Common Pod Failure States​

CrashLoopBackOff​

ImagePullBackOff​

Pending​

OOMKilled​

Concept 4: The Debugging Pattern​

Signal 1: Pod Status​

Signal 2: Events​

Signal 3: Logs​

Signal 4: Interactive Access​

Putting It Together: The Debugging Workflow​

Practice 1: Diagnose CrashLoopBackOff​

Practice 2: Diagnose Pending Pod Due to Insufficient Resources​

Practice 3: Diagnose OOMKilled and Adjust Limits​

Resource Management Best Practices​

Try With AI​

Concept 1: Resource Requests and Limits

The Mental Model: Requests vs Limits

Why This Matters

CPU and Memory Units

Concept 2: Quality of Service (QoS) Classes

The Three QoS Classes

The Eviction Decision

Concept 3: Common Pod Failure States

CrashLoopBackOff

ImagePullBackOff

Pending

OOMKilled

Concept 4: The Debugging Pattern

Signal 1: Pod Status

Signal 2: Events

Signal 3: Logs

Signal 4: Interactive Access

Putting It Together: The Debugging Workflow

Practice 1: Diagnose CrashLoopBackOff

Practice 2: Diagnose Pending Pod Due to Insufficient Resources

Practice 3: Diagnose OOMKilled and Adjust Limits

Resource Management Best Practices

Try With AI