Structured Extraction¶
After classifying items, you can extract typed attributes into your own Pydantic schemas using LLMs. This lets you turn unstructured German construction text into structured data — for example, extracting door dimensions, fire ratings, and materials from BoQ descriptions.
Setup¶
Structured extraction requires the LLM extras:
Basic Usage¶
Define a Pydantic schema, then extract:
from pydantic import BaseModel, Field
from pygaeb import GAEBParser, LLMClassifier, StructuredExtractor
doc = GAEBParser.parse("tender.X83")
# First, classify items so extraction knows what to look for
classifier = LLMClassifier(model="anthropic/claude-sonnet-4-6")
await classifier.enrich(doc)
# Define your schema
class DoorSpec(BaseModel):
door_type: str = Field("", description="single, double, sliding")
width_mm: int | None = Field(None, description="Width in mm")
fire_rating: str | None = Field(None, description="T30, T60, T90")
glazing: bool = Field(False, description="Has glass panels")
material: str = Field("", description="wood, steel, aluminium")
# Extract from all items classified as "Door"
extractor = StructuredExtractor(model="anthropic/claude-sonnet-4-6")
doors = await extractor.extract(doc, schema=DoorSpec, element_type="Door")
for item, spec in doors:
print(f"{item.oz}: {spec.door_type}, {spec.width_mm}mm, fire={spec.fire_rating}")
Filtering¶
Control which items are extracted using filter parameters:
# By element type (most common)
doors = await extractor.extract(doc, schema=DoorSpec, element_type="Door")
# By trade (broad)
pipes = await extractor.extract(doc, schema=PipeSpec, trade="MEP-Plumbing")
# By sub-type (narrow)
fire_doors = await extractor.extract(doc, schema=DoorSpec, sub_type="Fire Door")
# Combine filters
exterior = await extractor.extract(
doc, schema=WallSpec, trade="Structural", element_type="Wall", sub_type="Exterior Wall"
)
Extract from an Explicit Item List¶
Use extract_items() when you want full control over which items to extract from, bypassing classification-based filtering:
# Pick specific items
my_items = [item for item in doc.iter_items() if "Tür" in item.short_text]
results = await extractor.extract_items(
my_items, schema=DoorSpec,
trade_context="Finishes",
element_context="Door",
)
for item, spec in results:
print(item.oz, spec.door_type, spec.fire_rating)
Cost Estimation¶
Estimate the LLM cost before running extraction:
estimate = await extractor.estimate_cost(doc, schema=DoorSpec, element_type="Door")
print(f"Items to extract: {estimate['items_to_extract']}")
print(f"Cached (free): {estimate['cached_items']}")
print(f"Estimated tokens: {estimate['estimated_input_tokens']} in / {estimate['estimated_output_tokens']} out")
Progress Tracking¶
Monitor extraction progress with a callback:
def on_progress(completed: int, total: int, current_label: str):
print(f"[{completed}/{total}] Extracting {current_label}...")
doors = await extractor.extract(
doc, schema=DoorSpec, element_type="Door", on_progress=on_progress,
)
Re-extraction and Attachment¶
By default, cached results are reused and results are stored on item.extractions. Override with:
# Force re-extraction (bypass cache)
doors = await extractor.extract(
doc, schema=DoorSpec, element_type="Door", force_reextract=True,
)
# Don't store results on items (return only)
doors = await extractor.extract(
doc, schema=DoorSpec, element_type="Door", attach=False,
)
Synchronous API¶
Built-in Schemas¶
pyGAEB includes starter schemas for common element types:
DoorSpec¶
| Field | Type | Description |
|---|---|---|
door_type |
str |
single, double, sliding |
width_mm |
int \| None |
Width in mm |
height_mm |
int \| None |
Height in mm |
fire_rating |
str \| None |
Fire class (T30, T60, T90) |
acoustic_rating_db |
int \| None |
Sound insulation in dB |
surface_finish |
str \| None |
HPL, painted, veneer, etc. |
frame_material |
str \| None |
Steel, wood, aluminium |
glazing |
bool |
Has glass panels |
material |
str |
Primary material |
WindowSpec¶
| Field | Type | Description |
|---|---|---|
window_type |
str |
fixed, casement, tilt-turn |
width_mm |
int \| None |
Width in mm |
height_mm |
int \| None |
Height in mm |
u_value |
float \| None |
U-value W/(m2K) |
glazing_type |
str \| None |
double, triple |
frame_material |
str \| None |
PVC, aluminium, timber |
fire_rating |
str \| None |
Fire class if applicable |
sound_insulation_db |
int \| None |
Sound insulation in dB |
opening_direction |
str \| None |
left, right, top |
WallSpec and PipeSpec¶
Similar structured schemas for walls (thickness, material, load-bearing, insulation) and pipes (diameter, material, medium, pressure rating). See the Extractor Reference for full details.
Custom Schemas¶
Any Pydantic BaseModel works. Tips for best results:
- Use
Field(description=...)— the LLM reads field descriptions to understand what to extract - Use
Nonedefaults for optional fields — the LLM will leave them asNonewhen not found - Keep schemas focused — 5–15 fields per schema works best
- Use German-friendly descriptions — GAEB text is typically in German
class ConcreteSpec(BaseModel):
"""Extracted attributes for concrete work items."""
strength_class: str = Field("", description="Betongüte, e.g. C25/30, C30/37")
exposure_class: str | None = Field(None, description="Expositionsklasse, e.g. XC1, XD2")
thickness_mm: int | None = Field(None, description="Dicke/Stärke in mm")
reinforcement: bool = Field(False, description="Bewehrung vorhanden")
formwork: str | None = Field(None, description="Schalungsart")
Extraction Results¶
Results are stored on the item:
for item, spec in doors:
# spec is a DoorSpec instance
print(spec.model_dump())
# Also stored on the item for later access
result = item.extractions["DoorSpec"]
print(result.schema_name) # "DoorSpec"
print(result.data) # dict of extracted values
print(result.completeness) # 0.0–1.0 (fraction of non-default fields)
print(result.cached) # bool — was this from cache?
Caching¶
Extraction results are cached the same way as classification. See the Caching Guide for details on in-memory, SQLite, and custom backends.