Resource Management and Debugging
Your Kubernetes cluster is running Pods. Everything works perfectly in development. Then you deploy to production.
Your Pod crashes immediately. Or it stays Pending forever. Or it consumes all memory and gets evicted. You don't know why—you just see error states and no explanation.
This lesson teaches you to read what the cluster is trying to tell you. Kubernetes provides signals about Pod failures: status fields, events, logs, and resource constraints. Learning to interpret these signals is the difference between a 5-minute fix and hours of frustration.
Concept 1: Resource Requests and Limits
Before diving into debugging, you need to understand how Kubernetes allocates resources.
The Mental Model: Requests vs Limits
Think of resource management like renting an apartment:
- Request: "I need at least 2 bedrooms"—the landlord won't accept your application if they have fewer. This is your GUARANTEED minimum.
- Limit: "My apartment can have at most 3 bedrooms"—you don't need more than this. If you try to use 4, you get evicted.
In Kubernetes:
resources:
requests:
memory: "256Mi" # Guaranteed minimum
cpu: "100m" # for scheduling decisions
limits:
memory: "512Mi" # Maximum allowed
cpu: "500m" # can't exceed this
Key Principle: A Pod cannot be scheduled on a node unless that node has at least the REQUESTED amount of free resources. Limits prevent a Pod from monopolizing node resources.
Why This Matters
Requests enable fair scheduling. If you have 3 Pods:
- Pod A requests 1 CPU
- Pod B requests 1 CPU
- Pod C requests 1 CPU
Kubernetes won't schedule all three on a 2-CPU node. Request prevents overcommitment.
Limits enable isolation. If Pod A starts consuming 2 CPUs (more than its limit), Kubernetes throttles it. Pod B doesn't starve because Pod A went rogue.
CPU and Memory Units
CPU:
1000m= 1 CPU core100m= 0.1 CPU cores (100 millicores)0.5= half a CPU core (also written as500m)
Memory:
1Mi= 1 mebibyte (~1 million bytes, technically 1048576 bytes)1Gi= 1 gibibyte (~1 billion bytes)256Mi= 256 mebibytes (typical for small services)1Gi= 1 gibibyte (typical for memory-intensive services)
Always use Mi and Gi (binary) not MB and GB (decimal) in Kubernetes manifests. They're different.
Concept 2: Quality of Service (QoS) Classes
Kubernetes prioritizes which Pods to evict when a node runs out of resources. This priority is determined by the Pod's QoS class.
The Three QoS Classes
Guaranteed (Highest Priority)
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "100m"
When requests equal limits, the Pod is Guaranteed. Kubernetes evicts Guaranteed Pods LAST. Use this for critical workloads (databases, control planes).
Burstable (Medium Priority)
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
When requests < limits, the Pod is Burstable. Kubernetes evicts Burstable Pods second. Use this for normal workloads (most services, agents).
BestEffort (Lowest Priority)
resources: {}
# No requests or limits
When a Pod has no requests or limits, it's BestEffort. Kubernetes evicts these FIRST when memory pressure occurs. Only use this for batch jobs, not for Pods that need to stay running.
The Eviction Decision
Imagine your cluster is out of memory and needs to evict a Pod:
- Check BestEffort Pods first—evict one
- If memory pressure continues, evict Burstable Pods
- Only if nothing else works, evict Guaranteed Pods
This hierarchy ensures critical workloads stay running when resources are tight.
Concept 3: Common Pod Failure States
Pods fail in predictable ways. Each failure state has a specific cause and fix pattern.
CrashLoopBackOff
What you see:
NAME READY STATUS RESTARTS AGE
myapp 0/1 CrashLoopBackOff 5 2m
What it means: The container started, then crashed, then restarted, then crashed again. This cycle repeats 5+ times.
Root causes:
- Application error (bug in code)
- Missing environment variable
- Missing configuration file
- Port already in use
- Out of memory
Fix pattern:
- Check logs:
kubectl logs <pod-name>(shows why it crashed) - Check if limit was hit:
kubectl describe pod <pod-name>(look for OOMKilled) - Fix the underlying issue in your manifest or code
- Delete and recreate the Pod
ImagePullBackOff
What you see:
NAME READY STATUS RESTARTS AGE
myapp 0/1 ImagePullBackOff 0 1m
What it means: Kubernetes tried to pull the container image but failed.
Root causes:
- Image doesn't exist in registry
- Wrong image name or tag
- Registry credentials missing (private images)
- Network unreachable (can't reach registry)
Fix pattern:
- Check the event:
kubectl describe pod <pod-name>(look for "Failed to pull image") - Verify image name:
kubectl get pod -o yaml <pod-name>(find exact image reference) - Test locally:
docker pull <image-name>(can you pull it on your laptop?) - Fix the image reference in your manifest
- Apply the manifest again
Pending
What you see:
NAME READY STATUS RESTARTS AGE
myapp 0/1 Pending 0 5m
What it means: Kubernetes is trying to schedule the Pod, but no node has enough resources.
Root causes:
- Requested resources exceed cluster capacity
- Node affinity requirements not met
- Pod is waiting for PersistentVolume
- Node taint prevents Pod scheduling
Fix pattern:
- Check events:
kubectl describe pod <pod-name>(shows why scheduling failed) - Check node resources:
kubectl top nodes(are nodes overcommitted?) - Reduce Pod requests if they're too high
- Add more nodes to the cluster
- Check tolerations and affinity rules
OOMKilled
What you see (in describe output):
State: Waiting
Reason: CrashLoopBackOff
LastState: Terminated
Reason: OOMKilled
Exit Code: 137
What it means: The container consumed more memory than its limit, so Kubernetes forcefully terminated it.
Root causes:
- Application has a memory leak
- Limit is too low for the workload
- Processing unexpectedly large dataset
Fix pattern:
- Increase memory limit (if limit is genuinely too low)
- Profile the application (use tools like
pproffor Python/Go) - Fix the memory leak in code
- Change the application to process data in chunks instead of all at once
Concept 4: The Debugging Pattern
Kubernetes provides four signals for debugging. Learn to read them in order.
Signal 1: Pod Status
kubectl get pods
Output:
NAME READY STATUS RESTARTS AGE
nginx-good 1/1 Running 0 5m
nginx-crash 0/1 CrashLoopBackOff 3 2m
nginx-pending 0/1 Pending 0 1m
Status tells you WHAT is wrong (CrashLoopBackOff, Pending, etc.). But not WHY. Continue to Signal 2.
Signal 2: Events
kubectl describe pod <pod-name>
Output (partial):
Name: nginx-crash
Namespace: default
...
Events:
Type Reason Age Message
---- ------ --- -------
Normal Created 2m20s Created container nginx
Normal Started 2m19s Started container nginx
Warning BackOff 2m10s Back-off restarting failed container
Events show WHEN things happened and provide clues (like "restarting failed container"). For Pending Pods, events reveal scheduling reasons.
Signal 3: Logs
kubectl logs <pod-name>
Output (if app crashed):
Traceback (most recent call last):
File "app.py", line 5, in <module>
connect_to_db()
File "app.py", line 2, in connect_to_db
raise Exception("Database not found")
Exception: Database not found
Logs show the APPLICATION'S error message. This is where you find the root cause (missing env var, code bug, etc.).
For Pending Pods, logs are empty (Pod never started). Check events instead.
Signal 4: Interactive Access
kubectl exec -it <pod-name> -- /bin/bash
When the above three signals aren't enough, jump into the running Pod and investigate directly.
# Inside the Pod
$ env # Check environment variables
$ ls -la # Check filesystem
$ ps aux # Check running processes
$ curl localhost:8080 # Test internal services
This is your "last resort" debugging tool. Use when you need to poke around interactively.
Putting It Together: The Debugging Workflow
When a Pod fails:
- Get status:
kubectl get pods(What's the state?) - Describe:
kubectl describe pod <name>(Why is it in that state? Are there events?) - Check logs:
kubectl logs <name>(What did the application say?) - Investigate interactively:
kubectl exec -it <name> -- /bin/bash(What's actually happening inside?) - Fix: Modify manifest or code based on findings
- Apply:
kubectl apply -f manifest.yaml - Verify:
kubectl get pods(Did it work?)
This pattern works for 95% of Kubernetes debugging.
Practice 1: Diagnose CrashLoopBackOff
Create a Pod that crashes due to a missing environment variable.
Manifest (save as crash-loop.yaml):
apiVersion: v1
kind: Pod
metadata:
name: crash-loop-app
spec:
containers:
- name: app
image: python:3.11-slim
command: ["python", "-c"]
args:
- |
import os
db_url = os.environ['DATABASE_URL']
print(f"Connecting to {db_url}")
resources:
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "128Mi"
cpu: "100m"
restartPolicy: Always
Deploy it:
kubectl apply -f crash-loop.yaml
Output:
pod/crash-loop-app created
Check status (after 30 seconds):
kubectl get pods
Output:
NAME READY STATUS RESTARTS AGE
crash-loop-app 0/1 CrashLoopBackOff 2 35s
Describe to see events:
kubectl describe pod crash-loop-app
Output (relevant section):
Events:
Type Reason Age Message
---- ------ --- -------
Normal Scheduled 45s Successfully assigned default/crash-loop-app
Normal Created 44s Created container app
Normal Started 43s Started container app
Warning BackOff 20s Back-off restarting failed container
Check logs to see the actual error:
kubectl logs crash-loop-app
Output:
Traceback (most recent call last):
File "<string>", line 2, in <module>
db_url = os.environ['DATABASE_URL']
KeyError: 'DATABASE_URL'
Diagnosis: The application expects DATABASE_URL environment variable but it's not set.
Fix: Add the environment variable to the manifest:
apiVersion: v1
kind: Pod
metadata:
name: crash-loop-app
spec:
containers:
- name: app
image: python:3.11-slim
command: ["python", "-c"]
args:
- |
import os
db_url = os.environ['DATABASE_URL']
print(f"Connecting to {db_url}")
env:
- name: DATABASE_URL
value: "postgres://localhost:5432/mydb"
resources:
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "128Mi"
cpu: "100m"
restartPolicy: Always
Apply the fix:
kubectl apply -f crash-loop.yaml
Output:
pod/crash-loop-app configured
Check status:
kubectl get pods
Output:
NAME READY STATUS RESTARTS AGE
crash-loop-app 1/1 Running 0 5s
Check logs to confirm it's working:
kubectl logs crash-loop-app
Output:
Connecting to postgres://localhost:5432/mydb
Notice the Pod transitions from CrashLoopBackOff to Running. The restart counter resets because the underlying issue is fixed.
Clean up:
kubectl delete pod crash-loop-app
Practice 2: Diagnose Pending Pod Due to Insufficient Resources
Create a Pod that requests more resources than available.
Manifest (save as pending-pod.yaml):
apiVersion: v1
kind: Pod
metadata:
name: memory-hog
spec:
containers:
- name: app
image: python:3.11-slim
command: ["sleep", "3600"]
resources:
requests:
memory: "100Gi" # Way more than any node has
cpu: "50" # Way more cores than available
limits:
memory: "100Gi"
cpu: "50"
restartPolicy: Always
Deploy it:
kubectl apply -f pending-pod.yaml
Output:
pod/memory-hog created
Check status:
kubectl get pods
Output:
NAME READY STATUS RESTARTS AGE
memory-hog 0/1 Pending 0 10s
Describe to see why it's Pending:
kubectl describe pod memory-hog
Output (relevant section):
Events:
Type Reason Age Message
---- ------ --- -------
Warning FailedScheduling 15s 0/1 nodes are available: 1 Insufficient memory (requires 100Gi, but nodes only have ~5Gi free).
Diagnosis: The Pod requests 100Gi of memory, but no node has that much available. Kubernetes cannot schedule it.
Fix: Reduce the resource requests to reasonable values:
apiVersion: v1
kind: Pod
metadata:
name: memory-hog
spec:
containers:
- name: app
image: python:3.11-slim
command: ["sleep", "3600"]
resources:
requests:
memory: "256Mi" # Reasonable request
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
restartPolicy: Always
Apply the fix:
kubectl apply -f pending-pod.yaml
Output:
pod/memory-hog configured
Check status:
kubectl get pods
Output:
NAME READY STATUS RESTARTS AGE
memory-hog 1/1 Running 0 5s
Pod transitions from Pending to Running.
Clean up:
kubectl delete pod memory-hog
Practice 3: Diagnose OOMKilled and Adjust Limits
Create a Pod that exceeds its memory limit.
Manifest (save as oom-pod.yaml):
apiVersion: v1
kind: Pod
metadata:
name: memory-leak-app
spec:
containers:
- name: app
image: python:3.11-slim
command: ["python", "-c"]
args:
- |
import time
memory = []
while True:
# Allocate 50MB every 100ms
memory.append(bytearray(50 * 1024 * 1024))
time.sleep(0.1)
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "256Mi" # Limit memory to 256Mi
cpu: "500m"
restartPolicy: Always
Deploy it:
kubectl apply -f oom-pod.yaml
Output:
pod/memory-leak-app created
Check status immediately:
kubectl get pods
Output (within a few seconds):
NAME READY STATUS RESTARTS AGE
memory-leak-app 0/1 CrashLoopBackOff 1 5s
Describe to see the termination reason:
kubectl describe pod memory-leak-app
Output (relevant section):
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Message: The container was killed due to an out-of-memory condition.
Diagnosis: The application consumes memory faster than the 256Mi limit allows. Kubernetes kills it with OOMKilled.
Fix Options:
- Increase the limit (if the application actually needs more memory):
resources:
requests:
memory: "512Mi"
cpu: "100m"
limits:
memory: "1Gi" # Increase limit
cpu: "500m"
- Fix the memory leak (if there's a bug):
args:
- |
import time
memory = []
while True:
# Only keep last 5 allocations (500MB max)
if len(memory) > 5:
memory.pop(0)
memory.append(bytearray(50 * 1024 * 1024))
time.sleep(0.1)
For this example, let's increase the limit:
kubectl apply -f oom-pod.yaml
Check status:
kubectl get pods
Output:
NAME READY STATUS RESTARTS AGE
memory-leak-app 1/1 Running 0 5s
If the Pod stays running (not crashing), the limit was the issue. If it still crashes with the higher limit, there's a memory leak in the code that needs fixing.
Clean up:
kubectl delete pod memory-leak-app
Resource Management Best Practices
1. Always set requests and limits for production Pods
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
This ensures fair scheduling and prevents Pod eviction.
2. Make requests equal to limits for critical workloads (Guaranteed QoS)
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "250m"
This prevents your Pod from being evicted when other Pods fail.
3. Start conservative and increase based on monitoring
Set requests low (100m CPU, 128Mi memory) for new services, then monitor actual usage:
kubectl top pods # Shows actual CPU/memory usage
Increase requests based on observed usage + 20% headroom.
4. Use namespaces to isolate resource quotas
apiVersion: v1
kind: Namespace
metadata:
name: ai-services
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: ai-quota
namespace: ai-services
spec:
hard:
requests.memory: "10Gi" # All Pods in namespace combined
requests.cpu: "5"
Try With AI
You now have the mental models and debugging workflow. Let's collaborate with AI to troubleshoot a complex scenario.
Setup: Deploy a multi-container Pod with intentional resource and configuration issues. Use kubectl commands to inspect it, then iterate with AI to fix problems.
Your Assignment:
Create this manifest (save as complex-pod.yaml):
apiVersion: v1
kind: Pod
metadata:
name: multi-container-app
spec:
containers:
- name: web
image: nginx:1.25
ports:
- containerPort: 8080
env:
- name: ENVIRONMENT
value: "production"
resources:
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "128Mi"
cpu: "100m"
- name: sidecar
image: curlimages/curl:latest
command: ["sleep", "3600"]
resources:
requests:
memory: "32Mi"
cpu: "25m"
limits:
memory: "64Mi"
cpu: "50m"
restartPolicy: Always
Step 1: Deploy and diagnose
kubectl apply -f complex-pod.yaml
kubectl get pods
kubectl describe pod multi-container-app
kubectl logs multi-container-app -c web
kubectl logs multi-container-app -c sidecar
Step 2: Ask AI for analysis
Tell AI: "I've deployed a multi-container Pod with nginx and curl sidecar. Here are the kubectl outputs: [paste describe and logs]. What QoS class is this Pod? How would you monitor resource usage? What would happen if CPU requests were set to 50 instead of 50m?"
Step 3: Interactive exploration
Jump into the Pod and verify:
kubectl exec -it multi-container-app -c web -- /bin/sh
$ ps aux # Check if nginx is running
$ netstat -tlnp # Check port bindings
$ env # Verify environment variables
$ exit
Step 4: Propose a modification
Based on AI's suggestions, modify the manifest to:
- Change sidecar image to a production-ready one
- Add a liveness probe to the web container
- Adjust resource requests based on typical nginx usage
Ask AI: "Given that nginx typically uses 20-50m CPU and 50-100Mi memory in production, what requests and limits would you recommend? Should this be Guaranteed or Burstable?"
Step 5: Validate and explain
Apply your modified manifest:
kubectl apply -f complex-pod.yaml
kubectl get pods
kubectl top pod multi-container-app # View actual resource usage
Explain to AI:
- Why your resource choices match the QoS class you selected
- What signals you'd monitor to detect problems before they become critical
- How you'd adjust resources if you observed Pod eviction in a high-load scenario
Clean up:
kubectl delete pod multi-container-app