Skip to main content

Advanced Patterns — Timeouts, Futures, and Error Handling

You've learned how to run multiple tasks concurrently. Now imagine this: you're building a system that fetches data from 10 APIs in parallel. One of them starts responding slowly. Then slower. Then it never responds at all.

Without timeouts, your entire system hangs indefinitely. It's like calling a restaurant and never hanging up the phone—your line stays tied up forever, and no one else can reach you.

In production systems, timeouts aren't optional—they're survival. They're your defense against cascading failures, hanging requests, and systems that stop responding. This lesson teaches you the defensive patterns that separate prototype code from production-grade async systems.


Core 1: Timeout Control with asyncio.timeout()

The Timeout Problem

When working with I/O operations (network calls, database queries, file reads), you can't assume they'll finish quickly. Networks are unreliable. Services go down. Queries hang.

Loading Python environment...

The operation has no upper bound on how long it might take. If the server stops responding, await just... waits.

The asyncio.timeout() Pattern

Python 3.11+ provides the asyncio.timeout() context manager to enforce time limits:

Loading Python environment...

What happens with asyncio.timeout(5):

  1. If the operation completes within 5 seconds: Everything works normally
  2. If the operation takes longer than 5 seconds: TimeoutError is raised
  3. The context manager automatically cancels the operation when time expires

💬 AI Colearning Prompt (after timeout motivation):

Ask your AI: "What's the difference between asyncio.timeout() (Python 3.11+) and asyncio.wait_for()? When would I use each?"

Expected Output: AI explains that asyncio.timeout() is the modern context manager approach (replaces wait_for() for clarity), though both achieve the same goal.

Handling Timeout Gracefully

Simply raising an error isn't always helpful. Often you want fallback behavior:

Loading Python environment...

🎓 Instructor Commentary (after timeout motivation):

Timeouts aren't just for network calls—they're your defense against infinite waits and cascading failures. A single slow API can propagate delays through your entire system. With timeouts, you contain failures locally and degrade gracefully.


Core 2: Understanding Futures

What Are Futures?

A Future is an awaitable object that represents a result that isn't available yet—it's a placeholder for a value that might arrive in the future.

You've already encountered Futures indirectly:

  • asyncio.create_task() returns a Task (which is a subclass of Future)
  • asyncio.gather() returns a Future that resolves to a tuple of results
  • Executors return Future objects

In modern Python code, you rarely create Futures manually. The async machinery creates them for you.

Basic Future Example

Here's how Futures work conceptually:

Loading Python environment...

Key Point: You rarely create Futures manually. When you use create_task(), gather(), or executors, they create Futures internally.

When Do You Actually Use Futures?

Scenario 1: Debugging and Inspection

Loading Python environment...

Scenario 2: Bridge Sync to Async (Executor Results)

Loading Python environment...

🚀 CoLearning Challenge (after Futures section):

Tell your AI: "Create a monitoring system that tracks when Futures complete. It should check task.done() and print status updates. Explain what Future.result() returns and when you can call it safely."

Expected Output: Code that checks Future state, demonstrates task.done(), result(), and handles edge cases.


Core 3: Exception Handling in Async Code

The Challenge: Exceptions with Await

When you await a coroutine, exceptions can occur while you're suspended. You need to catch them properly:

Loading Python environment...

Key Points:

  1. try/except works normally with await
  2. Different exceptions can occur at different points:
    • Network errors (connection, timeouts)
    • Parsing errors (JSON, validation)
    • Cancellation errors (task cancelled externally)

CancelledError: When Tasks Are Cancelled

When a task is cancelled (usually by TaskGroup or explicit cancellation), a CancelledError is raised:

Loading Python environment...

When CancelledError Occurs:

  1. Another task explicitly calls task.cancel()
  2. A TaskGroup encounters an exception and cancels all other tasks
  3. The event loop is shutting down

Teaching Tip (after CancelledError):

When debugging async errors, ask your AI: "What's the difference between TimeoutError (from timeout context) and CancelledError (from task cancellation)? How should I handle each differently?"

Expected Output: AI clarifies that TimeoutError means the operation took too long (you can retry), while CancelledError means the task was cancelled externally (usually cleanup time).


Core 4: Common Async Pitfalls and Debugging

Never-Awaited Coroutines

One of the most common async bugs is forgetting the await keyword:

Loading Python environment...

The Error Message:

RuntimeWarning: coroutine 'fetch_data' was never awaited

The Fix

Loading Python environment...

Common Mistake: Blocking the Event Loop

Sometimes you accidentally call a blocking function inside async code:

Loading Python environment...

Key Rule: In async code, always use await asyncio.sleep(), not time.sleep().

Debugging with Your AI Companion

When you get an async error, your AI can help you understand it:

Loading Python environment...

When you see the warning, ask your AI:

"I got this RuntimeWarning: coroutine 'sleep' was never awaited. What does this mean, and how do I fix it?"

The AI will:

  1. Explain the error (coroutine not executed)
  2. Show the fix (add await)
  3. Explain why it matters (task won't complete, resources leak)

Core 5: Resilience Patterns

Retry Logic with Exponential Backoff

Real-world systems are unreliable. Networks fail. Services go down. A robust system doesn't give up on the first failure—it retries intelligently:

Loading Python environment...

How Exponential Backoff Works:

AttemptDelayTotal Wait
11s1s
22s3s
34s7s
48s15s

Instead of hammering a failing service, you give it time to recover.

Partial Failure Handling

When running multiple concurrent tasks, one failure shouldn't crash the entire system:

Loading Python environment...

🚀 CoLearning Challenge (after resilience patterns):

Tell your AI: "I need to build a circuit breaker pattern: if an API fails 5 times in a row, stop calling it for 60 seconds. After 60 seconds, try again (one call). If it succeeds, resume normal operation. If it fails, wait another 60 seconds. Implement this as a class with AI's help. Explain the tradeoffs: What if the API recovers before 60s?"

Expected Output: AI provides circuit breaker implementation (Open → Half-Open → Closed states), you discuss design tradeoffs.


Code Example Validation Steps

This section documents how the code examples were generated and validated.

Specification-to-Code Flow

For all code examples in this lesson:

Specification: Python 3.14+ async patterns with modern timeout handling, proper exception handling, and resilience patterns.

AI Prompts Used (representative):

"Generate Python 3.14 async code that uses asyncio.timeout() context manager
to fetch an API with a 5-second timeout. Handle TimeoutError gracefully with
a fallback value. Include full type hints."

"Create a resilient retry function using exponential backoff. It should retry
failed API calls up to 3 times, doubling the delay between retries, with a
maximum delay cap of 32 seconds. Add jitter to prevent thundering herd."

Validation Steps Performed:

  1. ✓ All code uses asyncio.timeout() (Python 3.11+) not deprecated wait_for()
  2. ✓ Full type hints on all functions (dict[str, Any], return types, | for union types)
  3. ✓ Code runs on Python 3.14+ (tested locally)
  4. ✓ Proper exception handling with specific exception types
  5. ✓ No hardcoded secrets or credentials
  6. ✓ All examples are runnable (import statements, complete code blocks)
  7. ✓ Production patterns: timeout controls, retry logic, graceful degradation

Challenge 3: The Async Context Manager Workshop

This challenge helps you master resilient async patterns through hands-on experimentation and AI collaboration.

Initial Exploration

Your Challenge: Experience timeouts and error handling without AI guidance.

Deliverable: Create /tmp/timeout_discovery.py containing:

  1. A function that sometimes slow (takes 5+ seconds) using asyncio.sleep()
  2. Code that calls it with asyncio.timeout(2) — should raise TimeoutError
  3. Code that handles the timeout with try/except and logs "timeout occurred"
  4. Test different timeout values and observe behavior

Expected Observation:

  • No timeout (10s): function completes normally
  • Timeout=2s: raises TimeoutError after 2 seconds
  • Handling timeout: exception caught, can continue execution

Self-Validation:

  • What's the difference between TimeoutError and CancelledError?
  • If you have 3 concurrent tasks and 1 times out, what happens to the others?
  • How would you retry a timed-out operation?

Understanding Timeout and Retry Patterns

💬 AI Colearning Prompt: "I built an async API client that sometimes hangs forever waiting for responses. I added a timeout, but now I get TimeoutError and my whole program crashes. Teach me how to handle timeouts gracefully. Show me: 1) How to timeout a single request, 2) How to retry on timeout, 3) How to continue fetching other APIs if one times out. Code examples please."

What You'll Learn: Timeout mechanics (asyncio.timeout as context manager), retry pattern with exponential backoff, and partial failure handling.

Clarifying Question: Deepen your understanding:

"You showed me catching TimeoutError inside a gather() call. But what's the difference between TimeoutError from asyncio.timeout() vs CancelledError from task cancellation? When would I see each one?"

Expected Outcome: AI clarifies timeout behavior and task lifecycle. You understand that timeouts and cancellations are different mechanisms with different implications.


Improving Resilience Patterns

Activity: Work with AI to improve timeout implementations and add retry logic.

First, ask AI to generate a basic timeout implementation:

Loading Python environment...

Your Task:

  1. Run this. One or more will timeout (sleep 3s with 2s timeout)
  2. Identify the issue: no retry logic, timeouts are fatal
  3. Teach AI:

"Your code times out and returns None. But what if I retry once? What if I retry 3 times with exponential backoff (wait 1s, then 2s, then 4s between attempts)? Show me the retry pattern. How would I implement exponential backoff?"

Your Edge Case Discovery: Ask AI:

"What happens if I set a global timeout for all 3 API calls combined (instead of per-request)? Like 'fetch all 3 within 5 seconds total, but don't care how they divide the time'? That's different from per-request timeout. Show me both patterns and explain when to use each."

Expected Outcome: You discover retry strategy (exponential backoff), global vs per-request timeouts, and circuit breaker concepts. You teach AI the resilience patterns production systems need.


Building a Resilient Data Fetcher

Capstone Activity: Build a resilient multi-source data fetcher.

Specification:

  • Fetch from 6 external services (simulated with asyncio.sleep)
  • 3 services: normal (0.5s), 1 service: slow (4s), 2 services: flaky (random timeout)
  • Per-request timeout: 2 seconds
  • Retry logic: up to 3 attempts with exponential backoff (1s, 2s, 4s between retries)
  • Global timeout: entire operation must complete within 15 seconds
  • Return: {service_name: (status, data/error_msg, retry_count)}
  • Type hints throughout

Deliverable: Save to /tmp/resilient_fetcher.py

Testing Your Work:

python /tmp/resilient_fetcher.py
# Expected output:
# Service 1: success (data, 1 attempt)
# Service 2: success (data, 1 attempt)
# Service 3: success (data, 2 attempts - retried once)
# Service 4: timeout (after 3 attempts)
# Service 5: success (data, 1 attempt)
# Service 6: timeout (after 2 attempts, global timeout kicked in)
# Total time: ~12-15 seconds

Validation Checklist:

  • Code runs without crashing
  • Slow services are retried (retry count > 1)
  • Global timeout prevents infinite waits (completes within 15s)
  • Failed services don't prevent others from completing
  • Exponential backoff visible in timing (gaps between retries increase)
  • Type hints complete
  • Follows production pattern (asyncio.run at top, try/except with proper cleanup)

Time Estimate: 32-38 minutes (5 min discover, 8 min teach/learn, 9 min edge cases, 10-17 min build artifact)

Key Takeaway: You understand how production systems handle cascading failures—timeouts prevent hangs, retries handle transient errors, and circuit breakers prevent overwhelming struggling services.


Try With AI

Why do production systems need timeouts, retries, AND circuit breakers when a single timeout seems sufficient?

🔍 Explore Timeout Patterns:

"Show me asyncio.wait_for() with a 2-second timeout wrapping a slow API call. What exception gets raised? Compare this to asyncio.wait() with timeout parameter. When do I use each?"

🎯 Practice Retry Logic:

"Implement exponential backoff retry (1s, 2s, 4s delays) for a flaky service that fails 70% of the time. Use asyncio.sleep() for delays. Show how retry count and total elapsed time differ."

🧪 Test Circuit Breaker:

"Create a circuit breaker that opens after 3 consecutive failures, stays open for 10s, then allows 1 test request. Show state transitions: closed → open → half-open → closed. What prevents cascading failures?"

🚀 Apply to Resilient Gateway:

"Design an API gateway with per-service timeouts (2s), 3-attempt retries with exponential backoff, and circuit breakers. Show how this handles: slow services, flaky services, and completely down services."