Lesson 11: Capstone - Data Processing Pipeline

The Challenge: From Data to Insight

You've now mastered three fundamental Python data structures: lists (ordered, mutable sequences), tuples (ordered, immutable sequences), and dictionaries (fast key-value lookups). But mastery means more than understanding each structure in isolation. The real test is this: Can you combine all three to solve a realistic, end-to-end data processing problem?

This is what data engineers, analysts, and backend developers do every day. They receive raw data—messy, unstructured, often in text format—and transform it into meaningful insights. A student dataset. A transaction log. A sensor reading stream. The pattern is always the same:

Ingest the data (get it into a usable format)
Parse it (structure it as objects you can work with)
Filter it (select only what matters)
Aggregate it (calculate summaries, patterns, counts)
Output it (present results to humans or systems)

In this lesson, you'll build a Data Processing Pipeline that demonstrates this entire workflow. You'll parse student records, filter by major and GPA, aggregate statistics by program, and output a professional summary report.

The Learning Goal: Prove you can think architecturally about data structures and build a complete application that combines all three collections intelligently. This is not a toy exercise—this is the foundation of real-world data work.

The Pipeline Architecture: Planning Before Code

💬 AI Colearning Prompt

"Design the data structure for this pipeline. We're processing student records (name, major, GPA). Should each record be a dict or a tuple? Why? What structure should we use for aggregating counts by major?"

Before you write a single line of code, understand the design. The best developers sketch their data structures first—on paper, in a notebook, or discussing with their AI partner.

Here's the structure we'll use:

Step 1: Raw Data (String Format)

raw_data: str = """
name,major,gpa
Alice,Computer Science,3.8
Bob,Mathematics,3.2
Carol,Computer Science,3.9
David,Physics,3.1
Eve,Computer Science,3.5
Frank,Mathematics,3.6
Grace,Physics,3.8
"""

This simulates reading a CSV file (we'll learn actual file I/O in Chapter 22). For now, it's just a multi-line string.

Step 2: Parsed Data (List of Dicts)

students: list[dict[str, str | float]] = [
    {"name": "Alice", "major": "Computer Science", "gpa": 3.8},
    {"name": "Bob", "major": "Mathematics", "gpa": 3.2},
    # ... more records
]

Each student is a dict (key-value mapping for field access by name), stored in a list (ordered collection of all records).

Step 3: Filtered Data (List Comprehension)

cs_students: list[dict[str, str | float]] = [
    student for student in students
    if student["major"] == "Computer Science" and student["gpa"] >= 3.5
]

We use a list comprehension with if conditions to select only records matching our criteria.

Step 4: Aggregated Results (Dict of Counts/Stats)

major_stats: dict[str, dict[str, float | int]] = {
    "Computer Science": {
        "count": 3,
        "average_gpa": 3.733,
    },
    "Mathematics": {
        "count": 2,
        "average_gpa": 3.4,
    },
    # ...
}

We use a dict mapping major names to summary stats (another dict containing counts and averages).

🎓 Instructor Commentary

In AI-native development, you don't just code—you design. Notice how we chose list for "ordered records" (we care about having all students), dict for "key-value lookup by student field", and dict again for "meaningful keys in aggregations". Structure choice is communication. When future you (or a teammate) reads this code, the structures tell the story.

Phase 1: Parse Raw Data into List of Dicts

Let's start with the foundation. You have a CSV-like string, and your job is to convert it into a list of dicts, where each dict represents one student record.

Specification:

Input: Multi-line string with headers on first line, data on remaining lines
Output: list[dict[str, str | float]] where keys are column names
Each dict = one record
Handle missing values gracefully (skip malformed rows)

Code Example: Data Parsing

📘 Note: In Chapter 20, you'll learn how to organize this parsing logic into reusable functions. For now, we're writing the code inline to focus on the data structure transformations—how lists and dicts work together to structure raw text.

# Raw CSV-like data (simulates reading a file)
raw_data: str = """name,major,gpa
Alice,Computer Science,3.8
Bob,Mathematics,3.2
Carol,Computer Science,3.9
David,Physics,3.1
Eve,Computer Science,3.5
Frank,Mathematics,3.6
Grace,Physics,3.8"""

# Step 1: Split the raw string into lines
lines: list[str] = raw_data.strip().split('\n')

# Step 2: Extract headers from first line
headers: list[str] = lines[0].split(',')
print(f"Headers: {headers}")  # ['name', 'major', 'gpa']

# Step 3: Parse each data line into a dict
students: list[dict[str, str | float]] = []

for line in lines[1:]:  # Skip first line (headers)
    if not line.strip():  # Skip empty lines
        continue

    values: list[str] = line.split(',')

    # Create dict with header -> value pairs
    record: dict[str, str | float] = {}
    for i, header in enumerate(headers):
        header_clean: str = header.strip()
        value_raw: str = values[i].strip() if i < len(values) else ""

        # Convert GPA to float if it's the GPA column
        if header_clean.lower() == 'gpa':
            record[header_clean] = float(value_raw)
        else:
            record[header_clean] = value_raw

    students.append(record)

print(f"Parsed {len(students)} students")
print(f"First student: {students[0]}")
# Output: {'name': 'Alice', 'major': 'Computer Science', 'gpa': 3.8}

✨ Teaching Tip

When debugging this parsing step, ask your AI: "Why is my list empty?" or "Show me what each dict contains after parsing". AI can help you visualize the structure and spot issues. Use print(students[0]) to inspect the first record.

Phase 2: Filter Data with Comprehensions

Now that you have structured data, filter it. Let's find all Computer Science students with a GPA of 3.5 or higher.

Specification:

Input: list[dict[str, str | float]] of all students
Criteria: major == "Computer Science" AND gpa >= 3.5
Output: list[dict] containing only matching records
Use comprehension (not a loop)

Code Example: Filtering with List Comprehension

# Filter: Computer Science students with GPA >= 3.5
cs_high_achievers: list[dict[str, str | float]] = [
    student for student in students
    if student["major"] == "Computer Science" and student["gpa"] >= 3.5
]

print(f"Found {len(cs_high_achievers)} CS students with GPA >= 3.5")
for student in cs_high_achievers:
    print(f"  - {student['name']}: {student['gpa']}")

Notice the two conditions in the if clause:

student["major"] == "Computer Science" (exact match)
student["gpa"] >= 3.5 (numeric comparison)

Both must be true for the student to be included.

💬 AI Colearning Prompt

"Show me how to write a list comprehension that filters students from multiple majors (Computer Science OR Mathematics). How would the condition change?"

🚀 CoLearning Challenge

Ask your AI Co-Teacher:

"Given the student data, write a comprehension that finds all students with GPA between 3.5 and 3.9 (inclusive). Then explain what each part of the comprehension does."

Expected Outcome: You'll understand how to combine multiple conditions in comprehensions and apply range-based filtering to numerical data.

Phase 3: Aggregate Data with Dictionaries

Filtering is useful, but aggregation is powerful. Now calculate statistics by major:

How many students in each major?
What's the average GPA per major?

Specification:

Input: list[dict[str, str | float]] of all students
Output: dict[str, dict[str, float | int]] where:
- Outer key = major name
- Inner dict = {"count": N, "average_gpa": X.XX}
Use dict to accumulate counts and sums

Code Example: Aggregation with Dict

📘 Note: This aggregation pattern—grouping data and calculating statistics—is fundamental to data analysis. In Chapter 20, you'll learn to package this logic into reusable functions. For now, focus on understanding the dict-based accumulation pattern.

# Initialize empty dict to store statistics by major
stats: dict[str, dict[str, float | int]] = {}

# Step 1: Accumulate counts and totals
for student in students:
    major: str = student["major"]
    gpa: float = student["gpa"]

    # Initialize major dict if not seen before
    if major not in stats:
        stats[major] = {
            "count": 0,
            "total_gpa": 0.0,
            "average_gpa": 0.0
        }

    # Accumulate
    stats[major]["count"] += 1
    stats[major]["total_gpa"] += gpa

# Step 2: Calculate averages
for major in stats:
    total_gpa: float = stats[major]["total_gpa"]
    count: int = stats[major]["count"]
    stats[major]["average_gpa"] = round(total_gpa / count, 2)
    # Remove temporary field (we don't need total_gpa in final output)
    del stats[major]["total_gpa"]

# Display results
print("Statistics by Major:")
for major, data in stats.items():
    print(f"{major}: {data['count']} students, avg GPA {data['average_gpa']}")

# Output:
# Computer Science: 3 students, avg GPA 3.73
# Mathematics: 2 students, avg GPA 3.4
# Physics: 2 students, avg GPA 3.45

Notice the pattern:

Check if key exists: if major not in stats
Initialize if needed: stats[major] = {...}
Accumulate: stats[major]["count"] += 1
Calculate final value: average_gpa = total_gpa / count

🎓 Instructor Commentary

This aggregation pattern appears everywhere: calculating totals, counting occurrences, tracking minimums/maximums. You're learning a skill that applies to data analysis, reporting, analytics dashboards, and more. The dict-based accumulator is fundamental. Syntax is cheap—understanding this pattern is gold.

🚀 CoLearning Challenge

Ask your AI:

"I need to find the student with the highest GPA in each major. How would I modify this aggregation to also track the top student's name in each major?"

Expected Outcome: You'll extend the aggregation pattern to track multiple values per group.

Phase 4: Output Formatted Results

Raw dicts are great for computation, but humans need readable output. Format your results as a professional summary report.

Specification:

Input: dict[str, dict[str, float | int]] of aggregated statistics
Output: Formatted string suitable for printing or saving
Include: Major name, student count, average GPA
Format clearly with spacing and alignment

Code Example: Formatted Output

# Build formatted report as a list of lines
title: str = "Student Statistics by Major"
lines: list[str] = [
    title,
    "=" * 50,
    ""
]

# Sort majors alphabetically for consistent output
sorted_majors: list[str] = sorted(stats.keys())

for major in sorted_majors:
    count: int = stats[major]["count"]
    avg_gpa: float = stats[major]["average_gpa"]

    # Format with alignment
    lines.append(f"{major:25s} | Count: {count:2d} | Avg GPA: {avg_gpa:.2f}")

lines.append("")
lines.append("=" * 50)

# Combine all lines into a single string with newlines
report: str = '\n'.join(lines)
print(report)

# Output looks like:
# Student Statistics by Major
# ==================================================
#
# Computer Science      | Count:  3 | Avg GPA: 3.73
# Mathematics           | Count:  2 | Avg GPA: 3.40
# Physics               | Count:  2 | Avg GPA: 3.45
#
# ==================================================

Notice the formatting techniques:

{major:25s} — left-align major name in 25 characters
{count:2d} — right-align integer in 2 characters
{avg_gpa:.2f} — float with 2 decimal places
'\n'.join(lines) — combine list of strings with newlines

✨ Teaching Tip

When your output doesn't look quite right, show it to your AI: "Here's my output. The columns don't line up. How can I fix the formatting?" AI can suggest better alignment and explain f-string formatting codes.

Putting It All Together: The Complete Pipeline

Now integrate all phases into one cohesive application. This is the complete, runnable code combining everything you've learned:

# ============================================================
# PHASE 1: PARSE RAW DATA
# ============================================================
raw_student_data: str = """name,major,gpa
Alice,Computer Science,3.8
Bob,Mathematics,3.2
Carol,Computer Science,3.9
David,Physics,3.1
Eve,Computer Science,3.5
Frank,Mathematics,3.6
Grace,Physics,3.8"""

# Split into lines and extract headers
lines: list[str] = raw_student_data.strip().split('\n')
headers: list[str] = lines[0].split(',')

# Parse each line into a dict
students: list[dict[str, str | float]] = []
for line in lines[1:]:
    if not line.strip():
        continue

    values: list[str] = line.split(',')
    record: dict[str, str | float] = {}

    for i, header in enumerate(headers):
        header_clean: str = header.strip()
        value_raw: str = values[i].strip() if i < len(values) else ""

        if header_clean.lower() == 'gpa':
            record[header_clean] = float(value_raw)
        else:
            record[header_clean] = value_raw

    students.append(record)

print(f"✓ Parsed {len(students)} student records\n")

# ============================================================
# PHASE 2: FILTER DATA
# ============================================================
cs_students: list[dict[str, str | float]] = [
    s for s in students
    if s["major"] == "Computer Science"
]
print(f"✓ Found {len(cs_students)} Computer Science students\n")

# ============================================================
# PHASE 3: AGGREGATE STATISTICS
# ============================================================
stats: dict[str, dict[str, float | int]] = {}

# Accumulate counts and totals
for student in students:
    major: str = student["major"]
    gpa: float = student["gpa"]

    if major not in stats:
        stats[major] = {"count": 0, "total_gpa": 0.0, "average_gpa": 0.0}

    stats[major]["count"] += 1
    stats[major]["total_gpa"] += gpa

# Calculate averages
for major in stats:
    total_gpa: float = stats[major]["total_gpa"]
    count: int = stats[major]["count"]
    stats[major]["average_gpa"] = round(total_gpa / count, 2)
    del stats[major]["total_gpa"]

print("✓ Calculated statistics by major\n")

# ============================================================
# PHASE 4: FORMAT AND OUTPUT REPORT
# ============================================================
title: str = "Student Statistics by Major"
lines: list[str] = [title, "=" * 50, ""]

sorted_majors: list[str] = sorted(stats.keys())
for major in sorted_majors:
    count: int = stats[major]["count"]
    avg_gpa: float = stats[major]["average_gpa"]
    lines.append(f"{major:25s} | Count: {count:2d} | Avg GPA: {avg_gpa:.2f}")

lines.append("")
lines.append("=" * 50)

report: str = '\n'.join(lines)
print(report)

Validation Checklist

Data parses correctly (right number of students, correct field values)
Filtering works (CS students match expected records)
Aggregation is accurate (counts and averages are correct)
Output is readable (aligned columns, no errors)
Code runs without exceptions

Common Pitfalls and How to Debug Them

Pitfall 1: KeyError When Accessing Dict Values

Error: KeyError: 'major'

Cause: The dict doesn't have the expected key (field name misspelled, or data parsing failed).

Debug approach:

Print the first dict to see what keys actually exist: print(students[0].keys())
Check if your header parsing is correct
Ask AI: "Why is my key 'major' not in the dict after parsing?"

Pitfall 2: Type Errors in Aggregation

Error: TypeError: '>' not supported between instances of 'str' and 'float'

Cause: You're trying to compare GPA but it's stored as a string instead of float.

Debug approach:

Print a student record: print(students[0]['gpa'], type(students[0]['gpa']))
Check your parsing function—is it converting to float?
Ask AI: "How do I convert string '3.8' to float in Python?"

Pitfall 3: Comprehension Syntax Error

Error: SyntaxError: invalid syntax

Cause: Missing colon, wrong if placement, or unbalanced brackets.

Debug approach:

Break the comprehension into a loop to verify logic:

# Instead of: [x for x in data if x > 3.5]
# Try this first:
result = []
for x in data:
    if x > 3.5:
        result.append(x)

Once the loop works, convert back to comprehension
Ask AI: "Convert this loop to a list comprehension and explain each part"

✨ Teaching Tip

When debugging, never just ask AI to fix your code. Instead, ask AI to explain what you're seeing: "I got this error. What does it mean?" Then work with AI to diagnose. This builds your debugging skills—the most valuable skill in professional development.

Extensions: Making It Real

Your basic pipeline works. Now make it more sophisticated. Choose one or more extensions:

Extension 1: Multi-Criteria Filtering

Filter students who are Computer Science OR Mathematics majors with GPA above 3.4:

stem_students: list[dict[str, str | float]] = [
    s for s in students
    if (s["major"] in ["Computer Science", "Mathematics"])
    and s["gpa"] > 3.4
]

Extension 2: Sort Results

Sort students by GPA (highest first) before output:

📘 Note: The lambda syntax below is a shorthand for defining small, inline operations. You'll learn lambda functions in Chapter 20. For now, just understand: key=lambda s: s["gpa"] means "sort by the 'gpa' field of each student dict."

top_students: list[dict[str, str | float]] = sorted(
    students,
    key=lambda s: s["gpa"],
    reverse=True
)

# Display sorted results
print("Top Students by GPA:")
for student in top_students:
    print(f"{student['name']}: {student['gpa']}")

Extension 3: Find Outliers

Find students whose GPA is significantly different from their major's average:

# Find students with GPA more than 0.3 above their major's average
threshold: float = 0.3
outliers: list[dict[str, str | float]] = []

for student in students:
    major: str = student["major"]
    avg: float = stats[major]["average_gpa"]
    difference: float = student["gpa"] - avg

    if difference > threshold:
        outliers.append(student)

print(f"Found {len(outliers)} high-performing outliers:")
for student in outliers:
    print(f"  {student['name']} ({student['major']}): {student['gpa']}")

Capstone Validation: Am I Done?

Check yourself against these criteria:

Core Functionality (Required):

Code parses raw CSV string into list[dict] ✓
Filtering works with at least one comprehension ✓
Aggregation calculates correct counts and averages ✓
Output is formatted and readable ✓
No runtime errors when processing data ✓

Code Quality (Expected):

Type hints present on all function signatures ✓
Variable names are descriptive (not x, data1, etc.) ✓
Comments explain non-obvious logic ✓
Code follows consistent indentation ✓

Understanding (Critical):

I can explain why each data structure (list, dict) was chosen ✓
I can justify the comprehension logic ✓
I could modify this for a different data format (products, employees, transactions) ✓
I asked AI when I was stuck and learned from the explanation ✓

Stretch Goals (Optional):

Implemented at least one extension ✓
Data handles edge cases (empty records, missing fields) ✓
Code is organized and easy to read ✓

Try With AI

Use your preferred AI companion (Claude Code, Gemini CLI, or ChatGPT web).

Prompt 1: Recall Architecture (Remember)

"I'm building a data pipeline to process student records. Should I store each record as a dict or a tuple? Should I use a list or a dict to aggregate statistics? Explain your reasoning for each choice."

Expected Outcome: You'll verify your understanding of structure selection. AI reinforces why list[dict] works for records and why dict is natural for aggregations.

Prompt 2: Understand the Pattern (Understand)

"Explain how list comprehensions with if conditions work. Show me a concrete example that filters students by major and GPA, then explain each part of the comprehension syntax."

Expected Outcome: You'll deepen your understanding of comprehension syntax and be able to read/write more complex filtering logic independently.

Prompt 3: Apply to New Data (Apply)

"Here's my student data pipeline working. Now I need to process employee records instead (name, department, salary). How would I modify my parsing, filtering, and aggregation functions for this new data? Show me the modified code with the same structure."

Expected Outcome: You'll prove you can transfer the pipeline pattern to different domains. This demonstrates true competency—not just following steps, but understanding the underlying pattern.

Prompt 4: Debug and Extend (Analyze)

"I'm getting a KeyError when filtering by department. [Paste your code]. Why is this happening? Help me debug it. Then, show me how to add a feature that calculates average salary per department."

Expected Outcome: You'll practice debugging with AI as a partner, moving from error to understanding. You'll also extend the pipeline with new aggregations—real-world application building.

Safety & Ethics Note: When AI suggests code, validate that it:

Correctly handles the data you're working with
Doesn't skip error handling for edge cases (empty lists, missing keys, type mismatches)
Uses type hints appropriately
Matches your project's style and structure

Ask AI: "Why did you choose this approach? Are there tradeoffs I should consider?" This builds critical thinking alongside coding skills.

Capstone Success

You've now completed the full journey from raw data to insights. You've:

Designed data structures strategically
Parsed text into structured Python objects
Filtered data with comprehensions
Aggregated results using dict-based accumulators
Output professional summaries

This is real work. Data engineers, backend developers, analytics engineers do this every day. You've demonstrated the core competency: architectural thinking combined with execution.

Congratulations on completing Chapter 18. You're ready for Chapter 19 (Sets and Frozen Sets) and beyond. The collection structures you've mastered form the foundation for everything that comes next—from functions that operate on collections to objects that contain collections as attributes.

What's next: In Chapter 20, you'll learn how to encapsulate this pipeline logic into reusable functions. In Chapter 21, you'll handle exceptions robustly when data is malformed. In Chapter 22, you'll read/write data from actual files. But the core pattern—ingest, transform, aggregate, output—remains your north star.

Keep building. Keep asking your AI partner. Keep validating. You're thinking like a developer now.

The Challenge: From Data to Insight​

The Pipeline Architecture: Planning Before Code​

💬 AI Colearning Prompt​

Step 1: Raw Data (String Format)​

Step 2: Parsed Data (List of Dicts)​

Step 3: Filtered Data (List Comprehension)​

Step 4: Aggregated Results (Dict of Counts/Stats)​

🎓 Instructor Commentary​

Phase 1: Parse Raw Data into List of Dicts​

Code Example: Data Parsing​

✨ Teaching Tip​

Phase 2: Filter Data with Comprehensions​

Code Example: Filtering with List Comprehension​

💬 AI Colearning Prompt​

🚀 CoLearning Challenge​

Phase 3: Aggregate Data with Dictionaries​

Code Example: Aggregation with Dict​

🎓 Instructor Commentary​

🚀 CoLearning Challenge​

Phase 4: Output Formatted Results​

Code Example: Formatted Output​

✨ Teaching Tip​

Putting It All Together: The Complete Pipeline​

Validation Checklist​

Common Pitfalls and How to Debug Them​

Pitfall 1: KeyError When Accessing Dict Values​

Pitfall 2: Type Errors in Aggregation​

Pitfall 3: Comprehension Syntax Error​

✨ Teaching Tip​

Extensions: Making It Real​

Extension 1: Multi-Criteria Filtering​

Extension 2: Sort Results​

Extension 3: Find Outliers​

Capstone Validation: Am I Done?​

Try With AI​

Prompt 1: Recall Architecture (Remember)​

Prompt 2: Understand the Pattern (Understand)​

Prompt 3: Apply to New Data (Apply)​

Prompt 4: Debug and Extend (Analyze)​

Capstone Success​

The Challenge: From Data to Insight

The Pipeline Architecture: Planning Before Code

💬 AI Colearning Prompt

Step 1: Raw Data (String Format)

Step 2: Parsed Data (List of Dicts)

Step 3: Filtered Data (List Comprehension)

Step 4: Aggregated Results (Dict of Counts/Stats)

🎓 Instructor Commentary

Phase 1: Parse Raw Data into List of Dicts

Code Example: Data Parsing

✨ Teaching Tip

Phase 2: Filter Data with Comprehensions

Code Example: Filtering with List Comprehension

💬 AI Colearning Prompt

🚀 CoLearning Challenge

Phase 3: Aggregate Data with Dictionaries

Code Example: Aggregation with Dict

🎓 Instructor Commentary

🚀 CoLearning Challenge

Phase 4: Output Formatted Results

Code Example: Formatted Output

✨ Teaching Tip

Putting It All Together: The Complete Pipeline

Validation Checklist

Common Pitfalls and How to Debug Them

Pitfall 1: KeyError When Accessing Dict Values

Pitfall 2: Type Errors in Aggregation

Pitfall 3: Comprehension Syntax Error

✨ Teaching Tip

Extensions: Making It Real

Extension 1: Multi-Criteria Filtering

Extension 2: Sort Results

Extension 3: Find Outliers

Capstone Validation: Am I Done?

Try With AI

Prompt 1: Recall Architecture (Remember)

Prompt 2: Understand the Pattern (Understand)

Prompt 3: Apply to New Data (Apply)

Prompt 4: Debug and Extend (Analyze)

Capstone Success