Lesson 4: CPU-Bound Work — GIL and InterpreterPoolExecutor

Opening Hook

Here's a puzzle: In Lesson 1, you learned that asyncio lets you run multiple tasks concurrently. So why not use asyncio for CPU-heavy calculations?

Try this thought experiment. You have a function that does heavy math (factorials, cryptography, data analysis). You create 4 async tasks that call this function. With asyncio, you'd expect them to run concurrently, right?

Wrong.

While the tasks are technically concurrent (the event loop switches between them), they run slower—not faster—than sequential execution. What's going on?

The culprit: the Global Interpreter Lock (GIL). And this lesson teaches you how to escape it using Python 3.14's new InterpreterPoolExecutor.

What Is the GIL, Really? (Brief Intro)

Python's Global Interpreter Lock (GIL) is a mechanism that allows only one thread to execute Python bytecode at a time. This was a design choice made to simplify memory management in CPython (the standard Python interpreter). The GIL prevents true parallelism for CPU-bound work—even with multiple threads, only one thread can run Python code at any moment. Threading helps with I/O-bound work (one thread waits while others run), but for CPU-bound tasks where every thread is doing calculations, the GIL becomes a bottleneck.

Deep exploration of GIL internals (how it works, why it exists, free-threaded mode) is covered in Chapter 29. For now, understand this simple fact: If you want true parallelism for CPU-bound work in Python, you need separate interpreters, not threads.

💬 AI Colearning Prompt

"Ask your AI: Why does Python have a GIL? What problem was it solving originally, and why haven't Python developers removed it?"

Why Threading Fails for CPU-Bound Work

Let's make this concrete with a benchmark.

Code Example 1: CPU-Bound Function

import time
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from concurrent.futures.interpreters import InterpreterPoolExecutor

def cpu_intensive_task(n: int) -> int:
    """Compute factorial recursively (CPU-bound)."""
    if n <= 1:
        return 1
    return n * cpu_intensive_task(n - 1)

# Simulate heavy calculation
def heavy_calculation(iterations: int) -> int:
    """Run CPU-intensive loop (no I/O)."""
    result = 0
    for i in range(iterations):
        result += i ** 2  # Math-heavy operation
    return result

This function spends 100% of its time doing math—no waiting for I/O. Perfect for testing parallelism.

🎓 Instructor Commentary

In AI-native development, you don't memorize the GIL limitation—you recognize the pattern: "My task is CPU-heavy, so threading won't help." That recognition is worth more than any theory.

Code Example 2: Threading Benchmark (Shows the Problem)

import time
from concurrent.futures import ThreadPoolExecutor

def benchmark_threading(num_workers: int = 4, iterations: int = 50_000_000) -> None:
    """Benchmark CPU-bound work with threading."""

    start = time.perf_counter()

    with ThreadPoolExecutor(max_workers=num_workers) as executor:
        # Submit 4 tasks to 4 threads
        futures = [
            executor.submit(heavy_calculation, iterations)
            for _ in range(num_workers)
        ]
        results = [f.result() for f in futures]

    elapsed = time.perf_counter() - start
    print(f"Threading (4 workers): {elapsed:.2f}s")
    print(f"Result: {results}")

# Single-threaded baseline
def benchmark_sequential(iterations: int = 50_000_000) -> None:
    """Benchmark sequential execution (no parallelism)."""
    start = time.perf_counter()
    results = [heavy_calculation(iterations) for _ in range(4)]
    elapsed = time.perf_counter() - start
    print(f"Sequential (1 thread): {elapsed:.2f}s")
    print(f"Result: {results}")

if __name__ == "__main__":
    print("=== CPU-Bound Work Benchmarks ===\n")
    benchmark_sequential()
    benchmark_threading()

Sample Output (on 4-core machine):

=== CPU-Bound Work Benchmarks ===

Sequential (1 thread): 4.53s
Threading (4 workers): 6.12s

Notice: Threading is SLOWER, not faster. Why? Because the GIL forces the 4 threads to compete for access to the single interpreter. Context switching overhead makes it worse than sequential execution.

🚀 CoLearning Challenge

Ask your AI Co-Teacher:

"Why does threading make CPU-bound work slower instead of faster? Explain how the GIL causes this contention and what context switching adds."

Expected Outcome: You'll understand that the GIL makes threading counterproductive for CPU work—the overhead of thread switching exceeds any benefit.

InterpreterPoolExecutor: The Solution (Python 3.14+)

Here's Python 3.14's elegant solution: separate interpreters, separate GILs.

Instead of one interpreter shared among threads (competing for the GIL), InterpreterPoolExecutor creates a pool of independent Python interpreters. Each interpreter has its own GIL. No sharing = no contention = true parallelism.

Core Concept: Separate Interpreters = Separate GILs

Traditional Threading (1 interpreter, 1 GIL):
┌─────────────────────────────┐
│   One Python Interpreter    │
│  Thread 1 │ Thread 2 │ GIL  │
│  (waiting for GIL)          │
│  (only 1 can run at a time) │
└─────────────────────────────┘

InterpreterPoolExecutor (4 interpreters, 4 GILs):
┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│ Interpreter1 │  │ Interpreter2 │  │ Interpreter3 │  │ Interpreter4 │
│   Worker 1   │  │   Worker 2   │  │   Worker 3   │  │   Worker 4   │
│    (GIL 1)   │  │    (GIL 2)   │  │    (GIL 3)   │  │    (GIL 4)   │
│   Running    │  │   Running    │  │   Running    │  │   Running    │
│ (all in true parallel on 4 cores)                  │
└──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘

Code Example 3: InterpreterPoolExecutor Benchmark (Shows the Solution)

import time
from concurrent.futures.interpreters import InterpreterPoolExecutor

def benchmark_interpreter_pool(num_workers: int = 4, iterations: int = 50_000_000) -> None:
    """Benchmark CPU-bound work with InterpreterPoolExecutor."""

    start = time.perf_counter()

    with InterpreterPoolExecutor(max_workers=num_workers) as executor:
        # Submit 4 tasks to 4 separate interpreters
        futures = [
            executor.submit(heavy_calculation, iterations)
            for _ in range(num_workers)
        ]
        results = [f.result() for f in futures]

    elapsed = time.perf_counter() - start
    print(f"InterpreterPoolExecutor (4 workers): {elapsed:.2f}s")
    print(f"Result: {results}")
    print(f"Speedup: {4.53 / elapsed:.2f}x")  # Compare to sequential

if __name__ == "__main__":
    print("=== CPU-Bound Work Benchmarks ===\n")
    print("Sequential (1 thread): 4.53s")
    benchmark_interpreter_pool()

Sample Output (on 4-core machine):

=== CPU-Bound Work Benchmarks ===

Sequential (1 thread): 4.53s
InterpreterPoolExecutor (4 workers): 1.15s
Speedup: 3.94x

Nearly 4x speedup on 4 cores! That's what true parallelism looks like.

✨ Teaching Tip

Use Claude Code to explore the overhead: "Create a benchmark comparing InterpreterPoolExecutor with 1, 2, 4, and 8 workers on your machine. What's the maximum speedup you observe?"

Bridging CPU Work into Async Code

Now here's the critical pattern: How do you use InterpreterPoolExecutor inside an async program?

The answer: loop.run_in_executor()—a bridge between sync functions and async code.

Code Example 4: Async Executor Integration with run_in_executor()

import asyncio
from concurrent.futures.interpreters import InterpreterPoolExecutor
from typing import Any

# Sync function (CPU-bound)
def cpu_intensive_work(data: str) -> str:
    """CPU-heavy string processing (no await)."""
    result = ""
    for _ in range(10_000_000):
        result = data + result  # Expensive operation
    return result[:100]

# Async function using executor
async def process_with_executor(
    executor: InterpreterPoolExecutor,
    data: str
) -> str:
    """Run CPU work in executor without blocking event loop."""
    loop = asyncio.get_running_loop()

    # This awaits the result without blocking the loop
    result = await loop.run_in_executor(executor, cpu_intensive_work, data)
    return result

async def main() -> None:
    """Main async program combining I/O and CPU work."""

    with InterpreterPoolExecutor(max_workers=4) as executor:
        start = time.perf_counter()

        # Run multiple CPU tasks concurrently
        tasks = [
            process_with_executor(executor, f"data_{i}")
            for i in range(4)
        ]

        results = await asyncio.gather(*tasks)

        elapsed = time.perf_counter() - start
        print(f"Async + InterpreterPoolExecutor: {elapsed:.2f}s")
        print(f"Results: {len(results)} tasks completed")

if __name__ == "__main__":
    import time
    asyncio.run(main())

Key Pattern:

Create the executor outside the async context
Pass it to async functions
Use await loop.run_in_executor(executor, function, args)
The event loop switches while CPU work happens in the background
Results return to the async context seamlessly

💬 AI Colearning Prompt

"Explain: What does loop.run_in_executor() do? Why do we need await here if the executor handles everything?"

ProcessPoolExecutor: An Alternative (With Tradeoffs)

InterpreterPoolExecutor is new in Python 3.14, so you might encounter ProcessPoolExecutor (the older approach) in existing codebases.

Key differences:

Feature	InterpreterPoolExecutor	ProcessPoolExecutor
Workers	Separate interpreters (lightweight)	Separate processes (heavyweight)
Memory	Shared memory, lower overhead	Isolated memory, high overhead
Startup	Fast (interpreter fork)	Slow (process startup)
Data passing	Direct (same Python namespace)	Serialization (pickle)
Best for	CPU work with Python objects	Long-running isolated tasks

Code Example 5: ProcessPoolExecutor Comparison

import time
from concurrent.futures import ProcessPoolExecutor
from concurrent.futures.interpreters import InterpreterPoolExecutor

def benchmark_process_pool(iterations: int = 50_000_000) -> None:
    """Benchmark ProcessPoolExecutor."""
    start = time.perf_counter()

    with ProcessPoolExecutor(max_workers=4) as executor:
        futures = [
            executor.submit(heavy_calculation, iterations)
            for _ in range(4)
        ]
        results = [f.result() for f in futures]

    elapsed = time.perf_counter() - start
    print(f"ProcessPoolExecutor (4 workers): {elapsed:.2f}s")

def benchmark_interpreter_pool(iterations: int = 50_000_000) -> None:
    """Benchmark InterpreterPoolExecutor."""
    start = time.perf_counter()

    with InterpreterPoolExecutor(max_workers=4) as executor:
        futures = [
            executor.submit(heavy_calculation, iterations)
            for _ in range(4)
        ]
        results = [f.result() for f in futures]

    elapsed = time.perf_counter() - start
    print(f"InterpreterPoolExecutor (4 workers): {elapsed:.2f}s")

if __name__ == "__main__":
    print("=== Executor Comparison ===\n")
    benchmark_process_pool()
    benchmark_interpreter_pool()

Typical Output:

=== Executor Comparison ===

ProcessPoolExecutor (4 workers): 2.34s (more startup overhead)
InterpreterPoolExecutor (4 workers): 1.15s (lighter weight)

🎓 Instructor Commentary

The GIL isn't a bug—it's a design tradeoff. Python 3.14 gives you tools to work around it when you need true parallelism. For most code, you'll prefer InterpreterPoolExecutor over ProcessPoolExecutor because it's lighter and faster.

Decision Tree: When to Use What

Here's the practical decision guide:

Code Example 6: Decision Tree (Conceptual Guide)

"""
Decision guide for choosing concurrency patterns.
Reference this when you ask: "What tool should I use?"
"""

def choose_concurrency_tool(task_type: str, data_size: str) -> str:
    """
    Recommend concurrency approach based on task characteristics.

    Args:
        task_type: "io_bound" or "cpu_bound"
        data_size: "small", "medium", or "large"

    Returns:
        Recommended executor/pattern
    """

    decision_tree = {
        "io_bound": {
            # I/O-bound: waiting for network, files, databases
            "small": "asyncio.TaskGroup()",  # Simple, fast
            "medium": "asyncio.TaskGroup() + semaphore",  # Control concurrency
            "large": "asyncio.TaskGroup() + semaphore + batching",  # Prevent overload
        },
        "cpu_bound": {
            # CPU-bound: heavy calculations
            "small": "ProcessPoolExecutor or InterpreterPoolExecutor",
            "medium": "InterpreterPoolExecutor (lighter weight)",
            "large": "InterpreterPoolExecutor (scales to cores)",
        }
    }

    return decision_tree[task_type][data_size]

# Example usage:
# print(choose_concurrency_tool("cpu_bound", "small"))
# Output: "ProcessPoolExecutor or InterpreterPoolExecutor"

# Decision rules:
# 1. Pure I/O-bound (API calls, file reads)? → Use asyncio (TaskGroup)
# 2. CPU-bound (calculations, data processing)? → Use InterpreterPoolExecutor
# 3. Both I/O and CPU mixed? → Use asyncio for I/O + InterpreterPoolExecutor for CPU
# 4. Long-running isolated task? → ProcessPoolExecutor
# 5. Quick Python calculations? → InterpreterPoolExecutor (lower overhead)

Putting It All Together: Hybrid Pattern

The real power emerges when you combine both patterns:

import asyncio
from concurrent.futures.interpreters import InterpreterPoolExecutor
import httpx  # async HTTP client

async def fetch_data(url: str, client: httpx.AsyncClient) -> str:
    """I/O-bound: fetch from network."""
    response = await client.get(url)
    return response.text

def process_data(raw_data: str) -> str:
    """CPU-bound: heavy processing (runs in interpreter)."""
    # Simulate expensive processing
    return raw_data.upper() * 1000

async def hybrid_workflow(urls: list[str]) -> None:
    """Combined I/O concurrency + CPU parallelism."""

    with InterpreterPoolExecutor(max_workers=4) as executor:
        loop = asyncio.get_running_loop()

        async with httpx.AsyncClient() as client:
            # I/O: fetch all concurrently
            async with asyncio.TaskGroup() as tg:
                fetch_tasks = [
                    tg.create_task(fetch_data(url, client))
                    for url in urls
                ]

            # CPU: process results in parallel
            process_tasks = [
                loop.run_in_executor(executor, process_data, result)
                for result in fetch_tasks
            ]

            results = await asyncio.gather(*process_tasks)
            print(f"Processed {len(results)} items")

# Timeline visualization:
# Fetch:    API1 ▓▓▓▓▓▓▓▓▓
#           API2 ▓▓▓▓▓▓▓▓▓  (concurrent)
#           API3 ▓▓▓▓▓▓▓▓▓
# Process:       CPU1 ▓▓▓▓▓▓▓▓▓
#                CPU2 ▓▓▓▓▓▓▓▓▓  (parallel on cores)
#                CPU3 ▓▓▓▓▓▓▓▓▓
#
# Total time: max(fetch_time) + max(process_time) [overlap!]
# NOT: sum of all times

CoLearning Synthesis

🚀 CoLearning Challenge

Ask your AI Co-Teacher:

"Design a system that fetches 10 JSON files from APIs and analyzes each with CPU-intensive parsing. How would you structure this using asyncio + InterpreterPoolExecutor? Draw a timeline showing where I/O and CPU work overlap."

Expected Outcome: You'll understand how hybrid patterns achieve both I/O concurrency and CPU parallelism, solving real-world AI workloads (API calls + inference).

Try With AI

Your AI companion tool (Claude Code, Gemini CLI, or ChatGPT web) is your co-teacher for this lesson. Work through these prompts progressively:

Prompt 1: Understanding the GIL

Ask your AI:

"What is the Global Interpreter Lock (GIL) in Python? Why does it exist, and why does it prevent threading from parallelizing CPU work? Keep it to 3-4 sentences."

Expected Output: Brief explanation (not deep technical details) that the GIL allows only one thread to run Python bytecode at once, preventing CPU parallelism with threading. Reference to memory safety or design decisions. Forward reference to Chapter 29 for deep dive.

Prompt 2: Using InterpreterPoolExecutor

Tell your AI:

"Generate code that: 1) Defines a CPU-intensive function that computes the sum of squares for 50 million iterations, 2) Benchmarks it with InterpreterPoolExecutor using 4 workers, 3) Shows timing and speedup calculation. Use modern Python 3.14+ patterns with type hints."

Expected Output: Working code with InterpreterPoolExecutor, time.perf_counter(), proper context managers, and calculation of speedup (should be ~3-4x on 4-core machine). Verify the code runs on your machine.

Prompt 3: Comparing Executors

Ask your AI:

"Compare InterpreterPoolExecutor vs ProcessPoolExecutor for CPU-bound work. What are the key differences in startup overhead, memory usage, and when you'd choose each? Create a decision table."

Expected Output: Table or structured comparison showing:

InterpreterPoolExecutor: lightweight, faster, shared namespace, good for Python calculations
ProcessPoolExecutor: isolated processes, higher overhead, serialization cost, good for long-running independent tasks

Prompt 4: Designing Hybrid Systems

Design with your AI:

"I need to build a system that: fetches 5 URLs concurrently using httpx, processes each response with expensive parsing (CPU), and stores results in a list. How would I architect this with asyncio + InterpreterPoolExecutor? Sketch the structure and explain where I/O and CPU overlap."

Expected Output:

Async function to fetch data with httpx.AsyncClient
Sync function for CPU-intensive parsing
Main async function using TaskGroup() for concurrent fetches
loop.run_in_executor() for parallel processing
Explanation of why this is faster than sequential execution

Validation: Run your AI-assisted code. Verify that:

All URLs are fetched concurrently (not sequentially)
Processing happens in parallel
Total time is approximately max(fetch_time) + max(process_time), not sum of all

Safety & Ethics Note

AI-generated code may contain security vulnerabilities. Always review for: proper error handling, no exposed credentials, input validation.
Test performance claims locally. Speedups depend on your machine's cores, workload characteristics, and system load.
GIL is a real constraint, but Chapter 29 explores experimental solutions like free-threaded Python mode (Python 3.13+). Stay informed as Python evolves.
Respect resource limits. Don't create 1000 worker interpreters—bind to os.cpu_count() for optimal parallelism.

Next Step

Once you've explored these prompts and validated your understanding, move to Lesson 5: Hybrid Workloads, where you'll build complete real-world systems combining all the patterns you've learned.

Opening Hook​

What Is the GIL, Really? (Brief Intro)​

💬 AI Colearning Prompt​

Why Threading Fails for CPU-Bound Work​

Code Example 1: CPU-Bound Function​

🎓 Instructor Commentary​

Code Example 2: Threading Benchmark (Shows the Problem)​

🚀 CoLearning Challenge​

InterpreterPoolExecutor: The Solution (Python 3.14+)​

Core Concept: Separate Interpreters = Separate GILs​

Code Example 3: InterpreterPoolExecutor Benchmark (Shows the Solution)​

✨ Teaching Tip​

Bridging CPU Work into Async Code​

Code Example 4: Async Executor Integration with run_in_executor()​

💬 AI Colearning Prompt​

ProcessPoolExecutor: An Alternative (With Tradeoffs)​

Code Example 5: ProcessPoolExecutor Comparison​

🎓 Instructor Commentary​

Decision Tree: When to Use What​

Code Example 6: Decision Tree (Conceptual Guide)​

Putting It All Together: Hybrid Pattern​

CoLearning Synthesis​

🚀 CoLearning Challenge​

Try With AI​

Prompt 1: Understanding the GIL​

Prompt 2: Using InterpreterPoolExecutor​

Prompt 3: Comparing Executors​

Prompt 4: Designing Hybrid Systems​

Safety & Ethics Note​

Next Step​

Opening Hook

What Is the GIL, Really? (Brief Intro)

💬 AI Colearning Prompt

Why Threading Fails for CPU-Bound Work

Code Example 1: CPU-Bound Function

🎓 Instructor Commentary

Code Example 2: Threading Benchmark (Shows the Problem)

🚀 CoLearning Challenge

InterpreterPoolExecutor: The Solution (Python 3.14+)

Core Concept: Separate Interpreters = Separate GILs

Code Example 3: InterpreterPoolExecutor Benchmark (Shows the Solution)

✨ Teaching Tip

Bridging CPU Work into Async Code

Code Example 4: Async Executor Integration with run_in_executor()

💬 AI Colearning Prompt

ProcessPoolExecutor: An Alternative (With Tradeoffs)

Code Example 5: ProcessPoolExecutor Comparison

🎓 Instructor Commentary

Decision Tree: When to Use What

Code Example 6: Decision Tree (Conceptual Guide)

Putting It All Together: Hybrid Pattern

CoLearning Synthesis

🚀 CoLearning Challenge

Try With AI

Prompt 1: Understanding the GIL

Prompt 2: Using InterpreterPoolExecutor

Prompt 3: Comparing Executors

Prompt 4: Designing Hybrid Systems

Safety & Ethics Note

Next Step