Skip to main content

Advanced Dataclass Features – Fields, Metadata, Post-Init, and Validation

In Lesson 3, you learned how @dataclass eliminates boilerplate by auto-generating __init__(), __repr__(), and __eq__(). But real-world data models need more control: mutable defaults without gotchas, validation on creation, computed fields, metadata for serialization. That's where advanced dataclass features come in.

In this lesson, you'll master the tools that let dataclasses handle production complexity while staying clean and readable. We'll explore field() for customization, __post_init__() for validation, InitVar for temporary data, and practical JSON serialization. By the end, you'll build dataclasses that enforce their own correctness and integrate seamlessly with APIs.

The Challenge: Default Values and Mutable Objects

Before diving into solutions, let's see why basic dataclass defaults can be dangerous. In Python, if you write this:

from dataclasses import dataclass

@dataclass
class TodoList:
items: list[str] = [] # ❌ DANGER!

All instances share the same list object:

list1 = TodoList()
list2 = TodoList()

list1.items.append("Task 1")
print(list2.items) # ['Task 1'] — WOW, not what we wanted!

This mutable default gotcha is the most common dataclass mistake. The solution: default_factory.

💬 AI Colearning Prompt

"Explain why mutable default arguments in Python are dangerous. Why does default=[] cause shared state between instances?"

The Solution: field() and default_factory

The field() function gives you fine control over each dataclass field. Here's the production-ready pattern:

from dataclasses import dataclass, field

@dataclass
class TodoList:
name: str
items: list[str] = field(default_factory=list) # ✅ Each instance gets its own list
tags: dict[str, str] = field(default_factory=dict) # ✅ Each instance gets its own dict
priority: int = 5 # Immutable defaults work fine

Now each instance gets its own mutable containers:

list1 = TodoList(name="Work")
list2 = TodoList(name="Personal")

list1.items.append("Review PR")
print(list2.items) # [] — correct, no shared state

Why this matters: In production APIs and databases, shared mutable state causes subtle bugs that only appear when you create multiple instances. Using default_factory is non-negotiable for production dataclasses.

🎓 Instructor Commentary

In AI-native development, you don't memorize Python's mutable default gotcha—you recognize "mutable type as default?" and immediately reach for default_factory. The pattern becomes automatic.

Code Example 1: Using default_factory for Mutable Defaults

Let's see field() in action. This is the foundation for all advanced dataclass features.

from dataclasses import dataclass, field
from typing import Any

@dataclass
class Product:
"""Production-ready product data model."""
name: str
price: float
tags: list[str] = field(default_factory=list)
metadata: dict[str, Any] = field(default_factory=dict)
quantity: int = 0

def __post_init__(self) -> None:
"""Validate on creation (explained next)."""
if self.price < 0:
raise ValueError(f"Price cannot be negative: {self.price}")
if self.quantity < 0:
raise ValueError(f"Quantity cannot be negative: {self.quantity}")

# Usage
product1 = Product(name="Laptop", price=999.99)
product2 = Product(name="Mouse", price=29.99)

# Each gets its own lists/dicts (no sharing)
product1.tags.append("electronics")
product2.tags.append("hardware")

print(product1.tags) # ['electronics']
print(product2.tags) # ['hardware']

Validation Step: Run this with type checking:

python script.py
mypy --strict script.py # Should pass type check

Specification Reference: This example demonstrates Spec Example 1 from the plan: "Using default_factory for mutable defaults (list, dict)"

AI Prompts Used:

  • "Create a dataclass for a Product with name, price, and mutable fields (tags, metadata). Use field() with default_factory for mutable types."
  • Validate: Run the code, confirm each instance has its own list/dict

Field Customization: More Control Over Individual Fields

Beyond default_factory, field() offers other parameters for controlling how fields behave:

from dataclasses import dataclass, field, fields

@dataclass
class APIResponse:
"""API response with customized fields."""
user_id: int
username: str

# Don't include in __init__ (computed later)
display_name: str = field(init=False, default="")

# Don't include in __repr__ (sensitive data)
api_key: str = field(repr=False, default="", doc="Secret API key (hidden from repr)")

# Don't compare (for equality checks)
request_id: str = field(compare=False, default="", doc="Request ID for tracing (not part of equality)")

# Metadata for validation/serialization (Python 3.14+ supports doc parameter)
email: str = field(
metadata={"validation": "email", "required": True},
default="",
doc="User email address with validation metadata"
)

Why each parameter matters:

  • init=False: Field won't appear in __init__() signature (good for computed fields)
  • repr=False: Field excluded from string representation (good for secrets)
  • compare=False: Field excluded from equality comparisons
  • metadata: Arbitrary data attached to field (for validators, serialization hints, documentation)
  • doc: NEW in Python 3.14 – Field documentation string (accessible via introspection)

🚀 CoLearning Challenge

Ask your AI Co-Teacher:

"Create a User dataclass with name, email, created_at. Add metadata to the email field for validation. Explain what metadata is and how you'd use it for validation."

Expected Outcome: You'll understand that metadata is arbitrary data you attach to fields for use in custom validation functions.

Code Example 2: Field with Metadata and init/repr Control

Here's a realistic example showing how these parameters work together:

from dataclasses import dataclass, field, fields
import re
from datetime import datetime
from typing import Any

@dataclass
class User:
"""User with custom field behavior."""
user_id: int
name: str
email: str = field(metadata={"pattern": r"^[^@]+@[^@]+\.[^@]+$", "required": True})

# Computed field (not in __init__)
email_verified: bool = field(init=False, default=False)

# Sensitive field (not in __repr__)
password_hash: str = field(repr=False, default="")

# Internal field (not compared in __eq__)
created_at: datetime = field(compare=False, default_factory=datetime.now)

# Custom metadata for validation
age: int = field(
metadata={"min": 0, "max": 150, "type": "age"},
default=0
)

# Access metadata at runtime
def validate_field(dataclass_instance: Any, field_name: str, value: Any) -> bool:
"""Validate a field using its metadata."""
for f in fields(dataclass_instance):
if f.name == field_name:
meta = f.metadata
if "pattern" in meta and isinstance(value, str):
pattern = meta["pattern"]
return bool(re.match(pattern, value))
if "min" in meta and isinstance(value, (int, float)):
return value >= meta["min"]
if "max" in meta and isinstance(value, (int, float)):
return value <= meta["max"]
return True

# Usage
user = User(
user_id=1,
name="Alice",
email="[email protected]",
password_hash="hashed_password_here",
age=30
)

print(user) # password_hash not shown ✓
print(user == User(user_id=1, name="Alice", email="[email protected]", created_at=datetime(2020, 1, 1)))
# True (created_at not compared because compare=False)

# Validate using metadata
assert validate_field(user, "email", "[email protected]")
assert not validate_field(user, "email", "not_an_email")
assert validate_field(user, "age", 25)
assert not validate_field(user, "age", 200) # Exceeds max

Specification Reference: Spec Example 2: "Field with metadata (for serialization, validation)"

Validation: Run code, verify field behavior (email not in repr, created_at not compared, etc.)

Validation After Creation: post_init()

The __post_init__() method runs immediately after __init__() completes. It's perfect for validation and computed fields that depend on other fields.

from dataclasses import dataclass, field
from datetime import datetime, timedelta

@dataclass
class Order:
"""Order with validation in __post_init__()."""
order_id: int
customer_name: str
amount: float

def __post_init__(self) -> None:
"""Validate order on creation."""
if self.amount <= 0:
raise ValueError(f"Amount must be positive, got {self.amount}")

if not self.customer_name or not self.customer_name.strip():
raise ValueError("Customer name cannot be empty")

# Valid order
order = Order(order_id=1, customer_name="Alice", amount=99.99)

# Invalid order — raises immediately
try:
bad_order = Order(order_id=2, customer_name="", amount=-50)
except ValueError as e:
print(f"Order creation failed: {e}")

Why post_init() is essential:

  1. Validation happens at creation time (fail fast)
  2. Invalid states are impossible to create
  3. Cleaner than manual validation after instantiation
  4. Computed fields can depend on constructor parameters

✨ Teaching Tip

Use Claude Code to explore edge cases: "What happens if I try to create an Order with amount=0? How would I handle that differently than amount=-50?"

Code Example 3: post_init() for Validation and Computed Fields

Here's a practical example combining validation with computed attributes:

from dataclasses import dataclass, field
from datetime import datetime, timedelta

@dataclass
class Subscription:
"""Subscription with validation and computed expiry date."""
user_id: int
plan_name: str
billing_cycle_days: int

# Computed fields (calculated in __post_init__)
created_at: datetime = field(init=False)
expires_at: datetime = field(init=False)
is_active: bool = field(init=False)

def __post_init__(self) -> None:
"""Validate and compute fields."""
# Validation: plan_name must be one of valid plans
valid_plans = {"starter", "pro", "enterprise"}
if self.plan_name not in valid_plans:
raise ValueError(f"Invalid plan: {self.plan_name}. Must be one of {valid_plans}")

# Validation: billing_cycle_days must be positive
if self.billing_cycle_days <= 0:
raise ValueError(f"Billing cycle must be positive, got {self.billing_cycle_days}")

# Compute fields
self.created_at = datetime.now()
self.expires_at = self.created_at + timedelta(days=self.billing_cycle_days)
self.is_active = self.expires_at > datetime.now()

# Usage
sub = Subscription(user_id=1, plan_name="pro", billing_cycle_days=30)
print(f"Subscription active: {sub.is_active}")
print(f"Expires: {sub.expires_at}")

# Invalid plan
try:
bad_sub = Subscription(user_id=2, plan_name="premium", billing_cycle_days=30)
except ValueError as e:
print(f"Subscription error: {e}")

Specification Reference: Spec Example 3: "post_init() for validation and computed fields"

Validation: Run code, verify validation works, check computed fields are set correctly

InitVar: Temporary Data for Initialization

Sometimes you need to pass data to __post_init__() for processing, but don't want to store it as an instance field. That's where InitVar comes in:

from dataclasses import dataclass, field, InitVar

@dataclass
class Account:
"""Account with password hashing (password not stored, hash is)."""
username: str
password_hash: str = field(init=False, repr=False, default="")

# InitVar: passed to __post_init__ but not stored
password: InitVar[str] = ""

def __post_init__(self, password: str) -> None:
"""Hash password on creation."""
if not password:
raise ValueError("Password required")

# Simple hash (use bcrypt in real code!)
self.password_hash = f"hashed_{password}"

# Usage: pass password, but it's not stored
account = Account(username="alice", password="secret123")
print(account) # password_hash shown, password not shown
# Account(username='alice', password_hash='hashed_secret123')

# The password parameter was used in __post_init__ but is not an instance field
print(hasattr(account, 'password')) # False

Key insight: InitVar fields appear in __init__() signature but NOT as instance fields. They're for data needed during initialization but not afterwards.

Code Example 4: InitVar for Post-Init Processing Without Storage

Here's a more complex example showing InitVar's power:

from dataclasses import dataclass, field, InitVar
import json

@dataclass
class Product:
"""Product with price validation and optional discount processing."""
sku: str
base_price: float
name: str

# Computed field
final_price: float = field(init=False, default=0.0)

# InitVar: discount percentage, used in __post_init__ but not stored
discount_percent: InitVar[int] = 0

def __post_init__(self, discount_percent: int) -> None:
"""Calculate final price after discount."""
if self.base_price <= 0:
raise ValueError(f"Base price must be positive, got {self.base_price}")

if not 0 <= discount_percent <= 100:
raise ValueError(f"Discount must be 0-100%, got {discount_percent}")

discount_amount = self.base_price * (discount_percent / 100.0)
self.final_price = self.base_price - discount_amount

# Usage
product = Product(sku="SKU-001", base_price=100.0, name="Laptop", discount_percent=10)
print(f"Base: ${product.base_price}, Discount: 10%, Final: ${product.final_price}")

# discount_percent was used in __post_init__ but isn't stored
print(hasattr(product, 'discount_percent')) # False ✓

Specification Reference: Spec Example 4: "InitVar for post-init processing without storage"

Validation: Run code, verify discount_percent is not a stored field, verify final_price is computed correctly

Serialization: Converting Dataclasses to JSON and Dicts

Real-world applications need to convert dataclasses to JSON (for APIs) and back. Python 3.10+ has asdict() and astuple() built in:

from dataclasses import dataclass, field, asdict, astuple
import json

@dataclass
class Address:
"""Simple address dataclass."""
street: str
city: str
zip_code: str

@dataclass
class Person:
"""Person with nested address."""
name: str
age: int
address: Address | None = None

# Create instance
person = Person(
name="Alice",
age=30,
address=Address(street="123 Main St", city="San Francisco", zip_code="94105")
)

# Convert to dict (handles nested objects!)
person_dict = asdict(person)
print(person_dict)
# {
# 'name': 'Alice',
# 'age': 30,
# 'address': {'street': '123 Main St', 'city': 'San Francisco', 'zip_code': '94105'}
# }

# Convert to JSON string
person_json = json.dumps(person_dict)
print(person_json)

# Convert back from dict
restored_person = Person(
name=person_dict['name'],
age=person_dict['age'],
address=Address(**person_dict['address']) if person_dict['address'] else None
)
print(restored_person)

Specification Reference: Spec Example 5: "Dataclass with JSON serialization (to_dict/from_dict)"

🎓 Instructor Commentary

You don't memorize JSON serialization techniques—you understand the pattern: "convert nested dataclass to dict, then to JSON". AI handles the details; you understand the flow.

Code Example 6: Real-World API Model with All Advanced Features

Here's a production-ready example combining everything: validation, computed fields, field customization, and serialization:

from dataclasses import dataclass, field, asdict, InitVar, fields
from datetime import datetime, timedelta
from typing import Any
import re
import json

@dataclass
class APIUser:
"""Real-world user model for API responses."""

# Required fields
user_id: int
email: str = field(
metadata={"pattern": r"^[^@]+@[^@]+\.[^@]+$"},
doc="User's email address (validated against email regex pattern)"
)

# Optional fields with defaults
username: str = field(default="", doc="Display username (defaults to empty string if not provided)")

# Mutable default (MUST use default_factory)
roles: list[str] = field(default_factory=lambda: ["user"])

# Field metadata for validation (Python 3.14+ supports doc parameter)
age: int = field(
default=0,
metadata={"min": 0, "max": 150},
doc="User age in years (must be 0-150)"
)

# Computed/internal fields (not in __init__)
created_at: datetime = field(init=False, repr=False)
is_verified: bool = field(init=False, default=False)

# Sensitive field (not in __repr__)
password_hash: str = field(repr=False, default="")

# InitVar for validation data
password: InitVar[str] = ""

def __post_init__(self, password: str) -> None:
"""Validate and compute fields."""
# Validate email format
if not re.match(r"^[^@]+@[^@]+\.[^@]+$", self.email):
raise ValueError(f"Invalid email format: {self.email}")

# Validate age range
if not (0 <= self.age <= 150):
raise ValueError(f"Age out of range: {self.age}")

# Hash password
if password:
if len(password) < 8:
raise ValueError("Password must be at least 8 characters")
self.password_hash = f"bcrypt_hash({password})"

# Set computed fields
self.created_at = datetime.now()
self.is_verified = False

def to_dict(user: APIUser) -> dict[str, Any]:
"""Convert user to dict for JSON serialization."""
data = asdict(user)
# Convert datetime to ISO string
data['created_at'] = user.created_at.isoformat()
return data

def from_dict(data: dict[str, Any]) -> APIUser:
"""Create user from dict (e.g., from API request)."""
# Parse datetime string
if 'created_at' in data:
data['created_at'] = datetime.fromisoformat(data['created_at'])

# Extract password for InitVar
password = data.pop('password', "")

# Create user (password goes to __post_init__)
return APIUser(**data, password=password)

# Usage
user = APIUser(
user_id=1,
email="[email protected]",
username="alice_wonderland",
age=28,
password="securepassword123"
)

print(f"User created: {user}") # password_hash not shown due to repr=False

# Convert to JSON
user_dict = to_dict(user)
user_json = json.dumps(user_dict, indent=2)
print(user_json)

# Convert back from JSON
restored_user = from_dict(user_dict)
print(f"Restored: {restored_user}")

# Validation catches errors
try:
bad_user = APIUser(
user_id=2,
email="not_an_email",
password="short"
)
except ValueError as e:
print(f"Validation error: {e}")

Specification Reference: Spec Example 6: "Real-world API model (combining all features)"

Validation Steps:

  1. Run the code successfully
  2. Check JSON serialization handles nested datetime
  3. Verify password_hash is not shown in repr
  4. Confirm validation catches invalid email and short password
  5. Verify asdict() includes all fields except InitVar

Common Mistakes to Avoid

You now understand the tools. Here are the pitfalls to watch for:

Mistake 1: Forgetting default_factory for Mutable Defaults

# ❌ WRONG - all instances share the same list
@dataclass
class Config:
items: list[str] = []

# ✅ RIGHT - each instance gets its own list
@dataclass
class Config:
items: list[str] = field(default_factory=list)

Mistake 2: Complex Logic in post_init()

__post_init__() should validate and compute simple fields. Complex logic belongs in methods:

# ❌ Too much in __post_init__()
def __post_init__(self):
# Calculate complex metrics
self.roi = self.revenue - self.costs / self.initial_investment
self.percentile = self.calculate_percentile()

# ✅ Keep __post_init__() simple
def __post_init__(self):
if self.revenue < 0:
raise ValueError("Revenue must be positive")

def calculate_roi(self) -> float:
"""ROI calculation as separate method."""
return (self.revenue - self.costs) / self.initial_investment

Mistake 3: Not Validating Field Metadata

Metadata is inert—it doesn't auto-validate. You must write validation logic:

# ❌ Metadata alone doesn't validate
@dataclass
class User:
age: int = field(metadata={"min": 0, "max": 150}) # Just metadata, no validation!

# ✅ Write validation in __post_init__()
@dataclass
class User:
age: int = field(metadata={"min": 0, "max": 150})

def __post_init__(self) -> None:
meta = field(age).__metadata__ # Access metadata
if not (meta["min"] <= self.age <= meta["max"]):
raise ValueError(f"Age out of range")

Mistake 4: Comparing Instances When You Shouldn't

By default, __eq__() compares all fields. Use compare=False for fields that shouldn't affect equality:

# ❌ created_at affects equality (usually not desired)
@dataclass
class User:
user_id: int
name: str
created_at: datetime

# ✅ created_at doesn't affect equality
@dataclass
class User:
user_id: int
name: str
created_at: datetime = field(compare=False)

Try With AI

In this lesson, you mastered advanced dataclass features: field customization, validation through __post_init__(), InitVar for temporary data, and serialization patterns. Now let's synthesize and extend your understanding.

Use Claude Code or your preferred AI companion to work through these prompts. For each, describe what you want before asking for code:

Prompt 1 (Recall): "What's the difference between default, default_factory, and InitVar? When would you use each?"

Expected Outcome: You explain that default works for immutable types, default_factory creates new mutable objects per instance, and InitVar passes data to __post_init__() without storing.

Prompt 2 (Understand): "How does __post_init__() work? Show me step-by-step what happens when I create a dataclass instance."

Expected Outcome: You trace: dataclass __init__() runs → __post_init__() runs → validation happens → computed fields set.

Prompt 3 (Apply): "Create a Product dataclass with:

  • Required: sku, name, base_price
  • Optional: discount_percent (InitVar, used in post_init)
  • Computed: final_price (calculated in post_init)
  • Validation: prices > 0, discount 0-100%

Include post_init() with validation. Test creating valid and invalid products."

Expected Outcome: Working dataclass with InitVar, validation, and computed field. Attempts to create invalid instances fail with clear error messages.

Prompt 4 (Create): "Design a BlogPost dataclass for an API response. Include:

  • Required: post_id, title, author, content
  • Optional: tags, published_at
  • Metadata on title field (e.g., max length)
  • post_init() validation
  • Methods to convert to dict and from dict for JSON serialization

Show how you'd handle nested comments (Comment dataclass)."

Expected Outcome: Production-ready dataclass design with metadata, validation, serialization methods, and handling of nested objects.

Safety/Ethics Note: Dataclass validation prevents invalid states at creation time, which is safer than discovering invalid data later. Always validate in __post_init__() for production code. When handling user input, validate thoroughly—don't trust user-provided emails, URLs, or sensitive data.


You've now mastered both basic dataclasses (Lesson 3) and advanced features (Lesson 4). In Lesson 5, you'll synthesize everything by comparing metaclasses, dataclasses, and traditional classes to choose the right tool for different problems.