Lesson 11: Capstone - Data Processing Pipeline
The Challenge: From Data to Insight
You've now mastered three fundamental Python data structures: lists (ordered, mutable sequences), tuples (ordered, immutable sequences), and dictionaries (fast key-value lookups). But mastery means more than understanding each structure in isolation. The real test is this: Can you combine all three to solve a realistic, end-to-end data processing problem?
This is what data engineers, analysts, and backend developers do every day. They receive raw data—messy, unstructured, often in text format—and transform it into meaningful insights. A student dataset. A transaction log. A sensor reading stream. The pattern is always the same:
- Ingest the data (get it into a usable format)
- Parse it (structure it as objects you can work with)
- Filter it (select only what matters)
- Aggregate it (calculate summaries, patterns, counts)
- Output it (present results to humans or systems)
In this lesson, you'll build a Data Processing Pipeline that demonstrates this entire workflow. You'll parse student records, filter by major and GPA, aggregate statistics by program, and output a professional summary report.
The Learning Goal: Prove you can think architecturally about data structures and build a complete application that combines all three collections intelligently. This is not a toy exercise—this is the foundation of real-world data work.
The Pipeline Architecture: Planning Before Code
💬 AI Colearning Prompt
"Design the data structure for this pipeline. We're processing student records (name, major, GPA). Should each record be a dict or a tuple? Why? What structure should we use for aggregating counts by major?"
Before you write a single line of code, understand the design. The best developers sketch their data structures first—on paper, in a notebook, or discussing with their AI partner.
Here's the structure we'll use:
Step 1: Raw Data (String Format)
raw_data: str = """
name,major,gpa
Alice,Computer Science,3.8
Bob,Mathematics,3.2
Carol,Computer Science,3.9
David,Physics,3.1
Eve,Computer Science,3.5
Frank,Mathematics,3.6
Grace,Physics,3.8
"""
This simulates reading a CSV file (we'll learn actual file I/O in Chapter 22). For now, it's just a multi-line string.
Step 2: Parsed Data (List of Dicts)
students: list[dict[str, str | float]] = [
{"name": "Alice", "major": "Computer Science", "gpa": 3.8},
{"name": "Bob", "major": "Mathematics", "gpa": 3.2},
# ... more records
]
Each student is a dict (key-value mapping for field access by name), stored in a list (ordered collection of all records).
Step 3: Filtered Data (List Comprehension)
cs_students: list[dict[str, str | float]] = [
student for student in students
if student["major"] == "Computer Science" and student["gpa"] >= 3.5
]
We use a list comprehension with if conditions to select only records matching our criteria.
Step 4: Aggregated Results (Dict of Counts/Stats)
major_stats: dict[str, dict[str, float | int]] = {
"Computer Science": {
"count": 3,
"average_gpa": 3.733,
},
"Mathematics": {
"count": 2,
"average_gpa": 3.4,
},
# ...
}
We use a dict mapping major names to summary stats (another dict containing counts and averages).
🎓 Instructor Commentary
In AI-native development, you don't just code—you design. Notice how we chose list for "ordered records" (we care about having all students), dict for "key-value lookup by student field", and dict again for "meaningful keys in aggregations". Structure choice is communication. When future you (or a teammate) reads this code, the structures tell the story.
Phase 1: Parse Raw Data into List of Dicts
Let's start with the foundation. You have a CSV-like string, and your job is to convert it into a list of dicts, where each dict represents one student record.
Specification:
- Input: Multi-line string with headers on first line, data on remaining lines
- Output: list[dict[str, str | float]] where keys are column names
- Each dict = one record
- Handle missing values gracefully (skip malformed rows)
Code Example: Data Parsing
📘 Note: In Chapter 20, you'll learn how to organize this parsing logic into reusable functions. For now, we're writing the code inline to focus on the data structure transformations—how lists and dicts work together to structure raw text.
# Raw CSV-like data (simulates reading a file)
raw_data: str = """name,major,gpa
Alice,Computer Science,3.8
Bob,Mathematics,3.2
Carol,Computer Science,3.9
David,Physics,3.1
Eve,Computer Science,3.5
Frank,Mathematics,3.6
Grace,Physics,3.8"""
# Step 1: Split the raw string into lines
lines: list[str] = raw_data.strip().split('\n')
# Step 2: Extract headers from first line
headers: list[str] = lines[0].split(',')
print(f"Headers: {headers}") # ['name', 'major', 'gpa']
# Step 3: Parse each data line into a dict
students: list[dict[str, str | float]] = []
for line in lines[1:]: # Skip first line (headers)
if not line.strip(): # Skip empty lines
continue
values: list[str] = line.split(',')
# Create dict with header -> value pairs
record: dict[str, str | float] = {}
for i, header in enumerate(headers):
header_clean: str = header.strip()
value_raw: str = values[i].strip() if i < len(values) else ""
# Convert GPA to float if it's the GPA column
if header_clean.lower() == 'gpa':
record[header_clean] = float(value_raw)
else:
record[header_clean] = value_raw
students.append(record)
print(f"Parsed {len(students)} students")
print(f"First student: {students[0]}")
# Output: {'name': 'Alice', 'major': 'Computer Science', 'gpa': 3.8}
✨ Teaching Tip
When debugging this parsing step, ask your AI: "Why is my list empty?" or "Show me what each dict contains after parsing". AI can help you visualize the structure and spot issues. Use
print(students[0])to inspect the first record.
Phase 2: Filter Data with Comprehensions
Now that you have structured data, filter it. Let's find all Computer Science students with a GPA of 3.5 or higher.
Specification:
- Input: list[dict[str, str | float]] of all students
- Criteria: major == "Computer Science" AND gpa >= 3.5
- Output: list[dict] containing only matching records
- Use comprehension (not a loop)
Code Example: Filtering with List Comprehension
# Filter: Computer Science students with GPA >= 3.5
cs_high_achievers: list[dict[str, str | float]] = [
student for student in students
if student["major"] == "Computer Science" and student["gpa"] >= 3.5
]
print(f"Found {len(cs_high_achievers)} CS students with GPA >= 3.5")
for student in cs_high_achievers:
print(f" - {student['name']}: {student['gpa']}")
Notice the two conditions in the if clause:
student["major"] == "Computer Science"(exact match)student["gpa"] >= 3.5(numeric comparison)
Both must be true for the student to be included.
💬 AI Colearning Prompt
"Show me how to write a list comprehension that filters students from multiple majors (Computer Science OR Mathematics). How would the condition change?"
🚀 CoLearning Challenge
Ask your AI Co-Teacher:
"Given the student data, write a comprehension that finds all students with GPA between 3.5 and 3.9 (inclusive). Then explain what each part of the comprehension does."
Expected Outcome: You'll understand how to combine multiple conditions in comprehensions and apply range-based filtering to numerical data.
Phase 3: Aggregate Data with Dictionaries
Filtering is useful, but aggregation is powerful. Now calculate statistics by major:
- How many students in each major?
- What's the average GPA per major?
Specification:
- Input: list[dict[str, str | float]] of all students
- Output: dict[str, dict[str, float | int]] where:
- Outer key = major name
- Inner dict =
{"count": N, "average_gpa": X.XX}
- Use dict to accumulate counts and sums
Code Example: Aggregation with Dict
📘 Note: This aggregation pattern—grouping data and calculating statistics—is fundamental to data analysis. In Chapter 20, you'll learn to package this logic into reusable functions. For now, focus on understanding the dict-based accumulation pattern.
# Initialize empty dict to store statistics by major
stats: dict[str, dict[str, float | int]] = {}
# Step 1: Accumulate counts and totals
for student in students:
major: str = student["major"]
gpa: float = student["gpa"]
# Initialize major dict if not seen before
if major not in stats:
stats[major] = {
"count": 0,
"total_gpa": 0.0,
"average_gpa": 0.0
}
# Accumulate
stats[major]["count"] += 1
stats[major]["total_gpa"] += gpa
# Step 2: Calculate averages
for major in stats:
total_gpa: float = stats[major]["total_gpa"]
count: int = stats[major]["count"]
stats[major]["average_gpa"] = round(total_gpa / count, 2)
# Remove temporary field (we don't need total_gpa in final output)
del stats[major]["total_gpa"]
# Display results
print("Statistics by Major:")
for major, data in stats.items():
print(f"{major}: {data['count']} students, avg GPA {data['average_gpa']}")
# Output:
# Computer Science: 3 students, avg GPA 3.73
# Mathematics: 2 students, avg GPA 3.4
# Physics: 2 students, avg GPA 3.45
Notice the pattern:
- Check if key exists:
if major not in stats - Initialize if needed:
stats[major] = {...} - Accumulate:
stats[major]["count"] += 1 - Calculate final value:
average_gpa = total_gpa / count
🎓 Instructor Commentary
This aggregation pattern appears everywhere: calculating totals, counting occurrences, tracking minimums/maximums. You're learning a skill that applies to data analysis, reporting, analytics dashboards, and more. The dict-based accumulator is fundamental. Syntax is cheap—understanding this pattern is gold.
🚀 CoLearning Challenge
Ask your AI:
"I need to find the student with the highest GPA in each major. How would I modify this aggregation to also track the top student's name in each major?"
Expected Outcome: You'll extend the aggregation pattern to track multiple values per group.
Phase 4: Output Formatted Results
Raw dicts are great for computation, but humans need readable output. Format your results as a professional summary report.
Specification:
- Input: dict[str, dict[str, float | int]] of aggregated statistics
- Output: Formatted string suitable for printing or saving
- Include: Major name, student count, average GPA
- Format clearly with spacing and alignment
Code Example: Formatted Output
# Build formatted report as a list of lines
title: str = "Student Statistics by Major"
lines: list[str] = [
title,
"=" * 50,
""
]
# Sort majors alphabetically for consistent output
sorted_majors: list[str] = sorted(stats.keys())
for major in sorted_majors:
count: int = stats[major]["count"]
avg_gpa: float = stats[major]["average_gpa"]
# Format with alignment
lines.append(f"{major:25s} | Count: {count:2d} | Avg GPA: {avg_gpa:.2f}")
lines.append("")
lines.append("=" * 50)
# Combine all lines into a single string with newlines
report: str = '\n'.join(lines)
print(report)
# Output looks like:
# Student Statistics by Major
# ==================================================
#
# Computer Science | Count: 3 | Avg GPA: 3.73
# Mathematics | Count: 2 | Avg GPA: 3.40
# Physics | Count: 2 | Avg GPA: 3.45
#
# ==================================================
Notice the formatting techniques:
{major:25s}— left-align major name in 25 characters{count:2d}— right-align integer in 2 characters{avg_gpa:.2f}— float with 2 decimal places'\n'.join(lines)— combine list of strings with newlines
✨ Teaching Tip
When your output doesn't look quite right, show it to your AI: "Here's my output. The columns don't line up. How can I fix the formatting?" AI can suggest better alignment and explain f-string formatting codes.
Putting It All Together: The Complete Pipeline
Now integrate all phases into one cohesive application. This is the complete, runnable code combining everything you've learned:
# ============================================================
# PHASE 1: PARSE RAW DATA
# ============================================================
raw_student_data: str = """name,major,gpa
Alice,Computer Science,3.8
Bob,Mathematics,3.2
Carol,Computer Science,3.9
David,Physics,3.1
Eve,Computer Science,3.5
Frank,Mathematics,3.6
Grace,Physics,3.8"""
# Split into lines and extract headers
lines: list[str] = raw_student_data.strip().split('\n')
headers: list[str] = lines[0].split(',')
# Parse each line into a dict
students: list[dict[str, str | float]] = []
for line in lines[1:]:
if not line.strip():
continue
values: list[str] = line.split(',')
record: dict[str, str | float] = {}
for i, header in enumerate(headers):
header_clean: str = header.strip()
value_raw: str = values[i].strip() if i < len(values) else ""
if header_clean.lower() == 'gpa':
record[header_clean] = float(value_raw)
else:
record[header_clean] = value_raw
students.append(record)
print(f"✓ Parsed {len(students)} student records\n")
# ============================================================
# PHASE 2: FILTER DATA
# ============================================================
cs_students: list[dict[str, str | float]] = [
s for s in students
if s["major"] == "Computer Science"
]
print(f"✓ Found {len(cs_students)} Computer Science students\n")
# ============================================================
# PHASE 3: AGGREGATE STATISTICS
# ============================================================
stats: dict[str, dict[str, float | int]] = {}
# Accumulate counts and totals
for student in students:
major: str = student["major"]
gpa: float = student["gpa"]
if major not in stats:
stats[major] = {"count": 0, "total_gpa": 0.0, "average_gpa": 0.0}
stats[major]["count"] += 1
stats[major]["total_gpa"] += gpa
# Calculate averages
for major in stats:
total_gpa: float = stats[major]["total_gpa"]
count: int = stats[major]["count"]
stats[major]["average_gpa"] = round(total_gpa / count, 2)
del stats[major]["total_gpa"]
print("✓ Calculated statistics by major\n")
# ============================================================
# PHASE 4: FORMAT AND OUTPUT REPORT
# ============================================================
title: str = "Student Statistics by Major"
lines: list[str] = [title, "=" * 50, ""]
sorted_majors: list[str] = sorted(stats.keys())
for major in sorted_majors:
count: int = stats[major]["count"]
avg_gpa: float = stats[major]["average_gpa"]
lines.append(f"{major:25s} | Count: {count:2d} | Avg GPA: {avg_gpa:.2f}")
lines.append("")
lines.append("=" * 50)
report: str = '\n'.join(lines)
print(report)
Validation Checklist
- Data parses correctly (right number of students, correct field values)
- Filtering works (CS students match expected records)
- Aggregation is accurate (counts and averages are correct)
- Output is readable (aligned columns, no errors)
- Code runs without exceptions
Common Pitfalls and How to Debug Them
Pitfall 1: KeyError When Accessing Dict Values
Error: KeyError: 'major'
Cause: The dict doesn't have the expected key (field name misspelled, or data parsing failed).
Debug approach:
- Print the first dict to see what keys actually exist:
print(students[0].keys()) - Check if your header parsing is correct
- Ask AI: "Why is my key 'major' not in the dict after parsing?"
Pitfall 2: Type Errors in Aggregation
Error: TypeError: '>' not supported between instances of 'str' and 'float'
Cause: You're trying to compare GPA but it's stored as a string instead of float.
Debug approach:
- Print a student record:
print(students[0]['gpa'], type(students[0]['gpa'])) - Check your parsing function—is it converting to float?
- Ask AI: "How do I convert string '3.8' to float in Python?"
Pitfall 3: Comprehension Syntax Error
Error: SyntaxError: invalid syntax
Cause: Missing colon, wrong if placement, or unbalanced brackets.
Debug approach:
- Break the comprehension into a loop to verify logic:
# Instead of: [x for x in data if x > 3.5]
# Try this first:
result = []
for x in data:
if x > 3.5:
result.append(x) - Once the loop works, convert back to comprehension
- Ask AI: "Convert this loop to a list comprehension and explain each part"
✨ Teaching Tip
When debugging, never just ask AI to fix your code. Instead, ask AI to explain what you're seeing: "I got this error. What does it mean?" Then work with AI to diagnose. This builds your debugging skills—the most valuable skill in professional development.
Extensions: Making It Real
Your basic pipeline works. Now make it more sophisticated. Choose one or more extensions:
Extension 1: Multi-Criteria Filtering
Filter students who are Computer Science OR Mathematics majors with GPA above 3.4:
stem_students: list[dict[str, str | float]] = [
s for s in students
if (s["major"] in ["Computer Science", "Mathematics"])
and s["gpa"] > 3.4
]
Extension 2: Sort Results
Sort students by GPA (highest first) before output:
📘 Note: The
lambdasyntax below is a shorthand for defining small, inline operations. You'll learn lambda functions in Chapter 20. For now, just understand:key=lambda s: s["gpa"]means "sort by the 'gpa' field of each student dict."
top_students: list[dict[str, str | float]] = sorted(
students,
key=lambda s: s["gpa"],
reverse=True
)
# Display sorted results
print("Top Students by GPA:")
for student in top_students:
print(f"{student['name']}: {student['gpa']}")
Extension 3: Find Outliers
Find students whose GPA is significantly different from their major's average:
# Find students with GPA more than 0.3 above their major's average
threshold: float = 0.3
outliers: list[dict[str, str | float]] = []
for student in students:
major: str = student["major"]
avg: float = stats[major]["average_gpa"]
difference: float = student["gpa"] - avg
if difference > threshold:
outliers.append(student)
print(f"Found {len(outliers)} high-performing outliers:")
for student in outliers:
print(f" {student['name']} ({student['major']}): {student['gpa']}")
Capstone Validation: Am I Done?
Check yourself against these criteria:
Core Functionality (Required):
- Code parses raw CSV string into list[dict] ✓
- Filtering works with at least one comprehension ✓
- Aggregation calculates correct counts and averages ✓
- Output is formatted and readable ✓
- No runtime errors when processing data ✓
Code Quality (Expected):
- Type hints present on all function signatures ✓
- Variable names are descriptive (not
x,data1, etc.) ✓ - Comments explain non-obvious logic ✓
- Code follows consistent indentation ✓
Understanding (Critical):
- I can explain why each data structure (list, dict) was chosen ✓
- I can justify the comprehension logic ✓
- I could modify this for a different data format (products, employees, transactions) ✓
- I asked AI when I was stuck and learned from the explanation ✓
Stretch Goals (Optional):
- Implemented at least one extension ✓
- Data handles edge cases (empty records, missing fields) ✓
- Code is organized and easy to read ✓
Try With AI
Use your preferred AI companion (Claude Code, Gemini CLI, or ChatGPT web).
Prompt 1: Recall Architecture (Remember)
"I'm building a data pipeline to process student records. Should I store each record as a dict or a tuple? Should I use a list or a dict to aggregate statistics? Explain your reasoning for each choice."
Expected Outcome: You'll verify your understanding of structure selection. AI reinforces why list[dict] works for records and why dict is natural for aggregations.
Prompt 2: Understand the Pattern (Understand)
"Explain how list comprehensions with if conditions work. Show me a concrete example that filters students by major and GPA, then explain each part of the comprehension syntax."
Expected Outcome: You'll deepen your understanding of comprehension syntax and be able to read/write more complex filtering logic independently.
Prompt 3: Apply to New Data (Apply)
"Here's my student data pipeline working. Now I need to process employee records instead (name, department, salary). How would I modify my parsing, filtering, and aggregation functions for this new data? Show me the modified code with the same structure."
Expected Outcome: You'll prove you can transfer the pipeline pattern to different domains. This demonstrates true competency—not just following steps, but understanding the underlying pattern.
Prompt 4: Debug and Extend (Analyze)
"I'm getting a KeyError when filtering by department. [Paste your code]. Why is this happening? Help me debug it. Then, show me how to add a feature that calculates average salary per department."
Expected Outcome: You'll practice debugging with AI as a partner, moving from error to understanding. You'll also extend the pipeline with new aggregations—real-world application building.
Safety & Ethics Note: When AI suggests code, validate that it:
- Correctly handles the data you're working with
- Doesn't skip error handling for edge cases (empty lists, missing keys, type mismatches)
- Uses type hints appropriately
- Matches your project's style and structure
Ask AI: "Why did you choose this approach? Are there tradeoffs I should consider?" This builds critical thinking alongside coding skills.
Capstone Success
You've now completed the full journey from raw data to insights. You've:
- Designed data structures strategically
- Parsed text into structured Python objects
- Filtered data with comprehensions
- Aggregated results using dict-based accumulators
- Output professional summaries
This is real work. Data engineers, backend developers, analytics engineers do this every day. You've demonstrated the core competency: architectural thinking combined with execution.
Congratulations on completing Chapter 18. You're ready for Chapter 19 (Sets and Frozen Sets) and beyond. The collection structures you've mastered form the foundation for everything that comes next—from functions that operate on collections to objects that contain collections as attributes.
What's next: In Chapter 20, you'll learn how to encapsulate this pipeline logic into reusable functions. In Chapter 21, you'll handle exceptions robustly when data is malformed. In Chapter 22, you'll read/write data from actual files. But the core pattern—ingest, transform, aggregate, output—remains your north star.
Keep building. Keep asking your AI partner. Keep validating. You're thinking like a developer now.