Service Discovery Deep Dive
You've learned how Services provide stable IP addresses to dynamic Pods. But what happens when Agent A in agents namespace tries calling Agent B's API in services namespace and gets "host not found"? The answer lies in Kubernetes DNS.
Service discovery is the mechanism that translates service names (like api.services.svc.cluster.local) into IP addresses. When DNS resolution fails, entire microservices architectures collapse. In this lesson, you'll build mental models of how Kubernetes DNS works, then debug common connectivity failures systematically.
The Service Discovery Problem: "Host Not Found"
Imagine this scenario:
Your agent in agents namespace calls the API service in services namespace:
kubectl run -it agent-debug --image=curlimages/curl --rm -- \
curl http://api.services/health
Output:
curl: (6) Could not resolve host: api.services
Why? The agent can't find the service. The full name exists—api.services.svc.cluster.local—but without understanding Kubernetes DNS architecture, you're debugging blind.
How Kubernetes DNS Works: CoreDNS Architecture
Kubernetes doesn't use external DNS for internal service discovery. Instead, every cluster runs CoreDNS—a DNS server that understands Kubernetes services and translates names to IP addresses.
The DNS Chain
When a Pod requests api.services:
1. Pod queries localhost:53 (DNS resolver in container)
2. Resolver forwards to kube-dns service (Cluster IP: 10.96.0.10)
3. CoreDNS pod answers using cluster's service records
4. IP address returned to Pod
Key insight: CoreDNS is itself deployed as a Pod (or Pods) in the kube-system namespace, managed by a Deployment:
kubectl get pods -n kube-system -l k8s-app=kube-dns
Output:
NAME READY STATUS RESTARTS AGE
coredns-7db6d8ff4c 1/1 Running 0 3d
coredns-7db6d8ff4d 1/1 Running 0 3d
CoreDNS runs as multiple replicas for redundancy. Each CoreDNS pod watches the Kubernetes API for Service changes and automatically updates DNS records.
DNS Naming Hierarchy: FQDN Explained
Every service in Kubernetes has a Fully Qualified Domain Name (FQDN). Understanding this hierarchy is critical for cross-namespace discovery.
The FQDN Structure
service.namespace.svc.cluster.local
Example breakdown for a service named api in services namespace:
api ← Service name
.services ← Namespace
.svc ← Service subdomain (always literal)
.cluster.local ← Cluster domain (configurable, usually "cluster.local")
Complete FQDN: api.services.svc.cluster.local
Short vs Long Names (Context Matters)
A Pod in the same namespace can use a short name:
# From a Pod in "services" namespace
curl http://api:8080/health
Output:
HTTP/1.1 200 OK
{"status": "healthy"}
A Pod in different namespace must use the full FQDN:
# From a Pod in "agents" namespace
curl http://api.services.svc.cluster.local:8080/health
Output:
HTTP/1.1 200 OK
{"status": "healthy"}
Why the FQDN Matters
Short names only work because CoreDNS automatically appends the Pod's namespace. When you request api from namespace agents:
- Pod searches for
apilocally (not found—doesn't exist inagents) - Pod searches for
api.agents.svc.cluster.local(not found—service is inservices) - Failure: "host not found"
The FQDN bypasses this search chain entirely.
Debugging DNS: nslookup and dig
When service discovery fails, manual DNS debugging reveals the root cause. You'll use two tools: nslookup (simpler) and dig (more detailed).
Setting Up a Debug Pod
Create a debug Pod with DNS tools:
kubectl run -it debug-dns --image=busybox:1.35 --rm --restart=Never -- \
/bin/sh
Once inside the Pod:
/ #
Using nslookup: DNS Lookup Basics
Query a service name:
nslookup api.services.svc.cluster.local
Output (Success):
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: api.services.svc.cluster.local
Address: 10.97.45.123
The server 10.96.0.10 is the kube-dns service. The resolved address 10.97.45.123 is the Service's ClusterIP.
Output (Failure - Wrong Namespace):
nslookup api.agents.svc.cluster.local
Server: 10.96.0.10
Address: 10.96.0.10#53
** server can't find api.agents.svc.cluster.local: NXDOMAIN
NXDOMAIN means "Non-Existent Domain"—the service doesn't exist in that namespace.
Using dig: Detailed DNS Information
dig provides deeper insights into DNS records:
dig api.services.svc.cluster.local
Output:
; <<>> DiG 9.16.1-Ubuntu <<>> api.services.svc.cluster.local
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12345
;; flags: qr rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;api.services.svc.cluster.local. IN A
;; ANSWER SECTION:
api.services.svc.cluster.local. 30 IN A 10.97.45.123
;; Query time: 2 msec
Key sections:
- QUESTION: What are we looking up? (A record for IP address)
- ANSWER: The IP address (10.97.45.123)
- Query time: DNS resolution took 2 milliseconds (fast = healthy)
SRV Records: Service Records for Load Balancing
For headless services, CoreDNS creates SRV records that list all Pod IPs:
dig SRV api.services.svc.cluster.local
Output:
;; ANSWER SECTION:
api.services.svc.cluster.local. 30 IN SRV 0 100 8080 api-5f7d8c4f9z.api.services.svc.cluster.local.
api.services.svc.cluster.local. 30 IN SRV 0 100 8080 api-5f7d8c4f9a.api.services.svc.cluster.local.
;; ADDITIONAL SECTION:
api-5f7d8c4f9z.api.services.svc.cluster.local. 30 IN A 10.244.0.5
api-5f7d8c4f9a.api.services.svc.cluster.local. 30 IN A 10.244.0.6
SRV records include:
- Port number (8080)
- Pod-specific FQDNs (api-5f7d8c4f9z.api.services.svc.cluster.local)
- Pod IPs (10.244.0.5, 10.244.0.6)
This allows clients to connect directly to Pods instead of going through the Service's load balancer.
Headless Services: Direct Pod Discovery
Some applications need to connect directly to Pods, not through a load-balanced Service. Database replication, peer discovery, and stateful applications all require this. That's where headless services come in.
ClusterIP vs Headless
A regular Service (ClusterIP):
apiVersion: v1
kind: Service
metadata:
name: api
namespace: services
spec:
selector:
app: api
ports:
- port: 8080
targetPort: 8080
Returns a single virtual IP:
kubectl get svc -n services api
Output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
api ClusterIP 10.97.45.123 <none> 8080/TCP
A headless Service (no ClusterIP):
apiVersion: v1
kind: Service
metadata:
name: db
namespace: databases
spec:
clusterIP: None # This makes it headless
selector:
app: postgres
ports:
- port: 5432
targetPort: 5432
Returns Pod IPs directly:
kubectl get svc -n databases db
Output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
db ClusterIP None <none> 5432/TCP
DNS Resolution: Headless Service
Query a headless service:
dig db.databases.svc.cluster.local
Output:
;; ANSWER SECTION:
db.databases.svc.cluster.local. 30 IN A 10.244.1.5
db.databases.svc.cluster.local. 30 IN A 10.244.1.6
db.databases.svc.cluster.local. 30 IN A 10.244.1.7
Instead of one virtual IP, DNS returns all Pod IPs that match the selector. Clients can connect to any Pod directly.
Troubleshooting Endpoint Mismatches
When DNS resolves but connections fail, the problem is usually endpoint mismatch: the Service selector doesn't match the Pod labels.
Discovering Endpoints
List endpoints for a service:
kubectl get endpoints api -n services
Output (Healthy):
NAME ENDPOINTS AGE
api 10.244.0.5:8080 5m
The Service knows about one Pod at IP 10.244.0.5.
Output (No Endpoints):
NAME ENDPOINTS AGE
api <none> 5m
The Service found zero Pods. This means the selector isn't matching any Pods.
Debugging Selector Mismatch
Check the Service's selector:
kubectl get svc api -n services -o jsonpath='{.spec.selector}'
Output:
{"app":"api"}
The Service looks for Pods with label app: api.
Check Pods in the namespace:
kubectl get pods -n services --show-labels
Output (Mismatch):
NAME STATUS LABELS
api-5f7d8c Running app=backend,version=v1
api-deployment Running app=api-service,version=v1
Neither Pod has app: api. The selector app: api matches zero Pods.
Output (Match):
NAME STATUS LABELS
api-5f7d8c Running app=api,version=v1
api-deployment Running app=api,version=v1
Both Pods have app: api. Now check endpoints again:
kubectl get endpoints api -n services
Output:
NAME ENDPOINTS AGE
api 10.244.0.5:8080,10.244.0.6:8080
Both Pods are now listed as endpoints.
Manual Label Verification
Use kubectl get pods with label selector syntax:
kubectl get pods -n services -l app=api
Output (No Results):
No resources found in services namespace.
The selector matches zero Pods.
Output (Match):
NAME READY STATUS RESTARTS AGE
api-5f7d8c 1/1 Running 0 5m
api-deployment 1/1 Running 0 3m
The selector matches Pods successfully. Now endpoints should be populated.
Cross-Namespace Service Discovery
Services in different namespaces don't see each other by default. You must use the FQDN to reach across namespaces.
Accessing Services Across Namespaces
From a Pod in agents namespace accessing api service in services namespace:
# Short name (fails - wrong namespace)
curl http://api:8080/health
# Error: Could not resolve host: api
# Full FQDN (succeeds)
curl http://api.services.svc.cluster.local:8080/health
# HTTP/1.1 200 OK
CoreDNS includes the full FQDN in all service discovery lookups. The FQDN is always correct.
Default Namespace Assumption
Pods assume their own namespace when resolving short names. CoreDNS tries:
servicename(fails immediately—not a valid FQDN)servicename.namespace.svc.cluster.local(finds the service in its own namespace)- Falls back through search domains if configured
This is why short names only work in-namespace.
Connectivity Test: From Debugging to Validation
Let's walk through a complete debugging scenario, from failure to root cause to fix.
Scenario: Agent A Calls Agent B
Setup:
- Agent A Pod in
agentsnamespace - Agent B Service in
servicesnamespace - Agent A can't reach Agent B
Step 1: Verify Service Exists
kubectl get svc -n services
Output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
agent-b ClusterIP 10.97.123.45 <none> 8080/TCP
Service exists. Step 2: Check endpoints.
Step 2: Verify Endpoints
kubectl get endpoints agent-b -n services
Output:
NAME ENDPOINTS AGE
agent-b <none> 10m
No endpoints. The Service has no Pods. Step 3: Check selectors.
Step 3: Check Service Selector
kubectl get svc agent-b -n services -o jsonpath='{.spec.selector}'
Output:
{"app":"agent-b","version":"v1"}
Step 4: Check if Pods match.
Step 4: List Pods with Required Labels
kubectl get pods -n services -l app=agent-b,version=v1
Output:
No resources found in services namespace.
No Pods match. Step 5: Check what Pods exist.
Step 5: List All Pods in Namespace
kubectl get pods -n services --show-labels
Output:
NAME READY STATUS LABELS
agent-b-5f7d8c 1/1 Running app=agent-b,version=v2
The Pod has label version=v2 but the Service requires version=v1.
Root Cause: Label mismatch.
Fix: Update Deployment to match Service selector:
apiVersion: apps/v1
kind: Deployment
metadata:
name: agent-b
namespace: services
spec:
selector:
matchLabels:
app: agent-b
version: v1 # Match Service selector
template:
metadata:
labels:
app: agent-b
version: v1 # Match Service selector
spec:
containers:
- name: agent-b
image: myregistry.azurecr.io/agent-b:latest
ports:
- containerPort: 8080
Apply and verify:
kubectl apply -f agent-b-deployment.yaml
kubectl get endpoints agent-b -n services
Output:
NAME ENDPOINTS AGE
agent-b 10.244.0.10:8080 2m
Endpoints now populated. Test from Agent A:
kubectl run -it agent-a-test --image=curlimages/curl --rm -- \
curl http://agent-b.services.svc.cluster.local:8080/health
Output:
HTTP/1.1 200 OK
{"status": "healthy"}
Success.
Try With AI
Setup: You have three microservices in Kubernetes:
auth-serviceinauthnamespace (handles authentication)api-gatewayinapinamespace (routes requests)user-dbindatabasesnamespace (PostgreSQL database)
Both api-gateway and auth-service are running, but api-gateway cannot reach auth-service. Connections time out.
Part 1: Manual Diagnosis
Before asking AI, diagnose manually:
-
Verify
auth-serviceexists:kubectl get svc -n auth auth-service -
Check endpoints:
kubectl get endpoints auth-service -n auth -
Query DNS from a debug Pod:
kubectl run -it debug --image=busybox:1.35 --rm --restart=Never -- \
nslookup auth-service.auth.svc.cluster.local -
Check the Service selector:
kubectl get svc auth-service -n auth -o jsonpath='{.spec.selector}' -
List Pods in the namespace and compare labels:
kubectl get pods -n auth --show-labels
Part 2: Gathering Evidence
Before consulting AI, collect:
- Service definition (YAML)
- Endpoint status
- Pod labels
- DNS query results
- Error message from API Gateway logs
This evidence determines whether the issue is DNS, endpoints, or labels.
Part 3: Collaboration with AI
Once you've gathered evidence, ask AI:
"I have auth-service in the auth namespace. DNS resolves auth-service.auth.svc.cluster.local to 10.97.50.100, but api-gateway in the api namespace gets connection timeouts. Here's the Service definition [paste YAML], the endpoints output shows <none>, and the Pods in the namespace have labels [paste labels]. What's the likely root cause?"
AI will identify the mismatch between Service selector and Pod labels, or point out that endpoints are missing entirely.
Part 4: Validation
Test your fix:
kubectl run -it api-test --image=curlimages/curl --rm -- \
curl -v http://auth-service.auth.svc.cluster.local:8080/health
Connection should succeed (HTTP 200 or appropriate response).
Compare your manual diagnosis to AI's analysis. What did you miss? What did AI catch that you didn't?