Skip to content

Structured Extraction

After classifying items, you can extract typed attributes into your own Pydantic schemas using LLMs. This lets you turn unstructured German construction text into structured data — for example, extracting door dimensions, fire ratings, and materials from BoQ descriptions.

Setup

Structured extraction requires the LLM extras:

pip install pyGAEB[llm]

Basic Usage

Define a Pydantic schema, then extract:

from pydantic import BaseModel, Field
from pygaeb import GAEBParser, LLMClassifier, StructuredExtractor

doc = GAEBParser.parse("tender.X83")

# First, classify items so extraction knows what to look for
classifier = LLMClassifier(model="anthropic/claude-sonnet-4-6")
await classifier.enrich(doc)

# Define your schema
class DoorSpec(BaseModel):
    door_type: str = Field("", description="single, double, sliding")
    width_mm: int | None = Field(None, description="Width in mm")
    fire_rating: str | None = Field(None, description="T30, T60, T90")
    glazing: bool = Field(False, description="Has glass panels")
    material: str = Field("", description="wood, steel, aluminium")

# Extract from all items classified as "Door"
extractor = StructuredExtractor(model="anthropic/claude-sonnet-4-6")
doors = await extractor.extract(doc, schema=DoorSpec, element_type="Door")

for item, spec in doors:
    print(f"{item.oz}: {spec.door_type}, {spec.width_mm}mm, fire={spec.fire_rating}")

Filtering

Control which items are extracted using filter parameters:

# By element type (most common)
doors = await extractor.extract(doc, schema=DoorSpec, element_type="Door")

# By trade (broad)
pipes = await extractor.extract(doc, schema=PipeSpec, trade="MEP-Plumbing")

# By sub-type (narrow)
fire_doors = await extractor.extract(doc, schema=DoorSpec, sub_type="Fire Door")

# Combine filters
exterior = await extractor.extract(
    doc, schema=WallSpec, trade="Structural", element_type="Wall", sub_type="Exterior Wall"
)

Extract from an Explicit Item List

Use extract_items() when you want full control over which items to extract from, bypassing classification-based filtering:

# Pick specific items
my_items = [item for item in doc.iter_items() if "Tür" in item.short_text]

results = await extractor.extract_items(
    my_items, schema=DoorSpec,
    trade_context="Finishes",
    element_context="Door",
)
for item, spec in results:
    print(item.oz, spec.door_type, spec.fire_rating)

Cost Estimation

Estimate the LLM cost before running extraction:

estimate = await extractor.estimate_cost(doc, schema=DoorSpec, element_type="Door")
print(f"Items to extract: {estimate['items_to_extract']}")
print(f"Cached (free): {estimate['cached_items']}")
print(f"Estimated tokens: {estimate['estimated_input_tokens']} in / {estimate['estimated_output_tokens']} out")

Progress Tracking

Monitor extraction progress with a callback:

def on_progress(completed: int, total: int, current_label: str):
    print(f"[{completed}/{total}] Extracting {current_label}...")

doors = await extractor.extract(
    doc, schema=DoorSpec, element_type="Door", on_progress=on_progress,
)

Re-extraction and Attachment

By default, cached results are reused and results are stored on item.extractions. Override with:

# Force re-extraction (bypass cache)
doors = await extractor.extract(
    doc, schema=DoorSpec, element_type="Door", force_reextract=True,
)

# Don't store results on items (return only)
doors = await extractor.extract(
    doc, schema=DoorSpec, element_type="Door", attach=False,
)

Synchronous API

doors = extractor.extract_sync(doc, schema=DoorSpec, element_type="Door")

Built-in Schemas

pyGAEB includes starter schemas for common element types:

from pygaeb.extractor.builtin_schemas import DoorSpec, WindowSpec, WallSpec, PipeSpec

DoorSpec

Field Type Description
door_type str single, double, sliding
width_mm int \| None Width in mm
height_mm int \| None Height in mm
fire_rating str \| None Fire class (T30, T60, T90)
acoustic_rating_db int \| None Sound insulation in dB
surface_finish str \| None HPL, painted, veneer, etc.
frame_material str \| None Steel, wood, aluminium
glazing bool Has glass panels
material str Primary material

WindowSpec

Field Type Description
window_type str fixed, casement, tilt-turn
width_mm int \| None Width in mm
height_mm int \| None Height in mm
u_value float \| None U-value W/(m2K)
glazing_type str \| None double, triple
frame_material str \| None PVC, aluminium, timber
fire_rating str \| None Fire class if applicable
sound_insulation_db int \| None Sound insulation in dB
opening_direction str \| None left, right, top

WallSpec and PipeSpec

Similar structured schemas for walls (thickness, material, load-bearing, insulation) and pipes (diameter, material, medium, pressure rating). See the Extractor Reference for full details.

Custom Schemas

Any Pydantic BaseModel works. Tips for best results:

  1. Use Field(description=...) — the LLM reads field descriptions to understand what to extract
  2. Use None defaults for optional fields — the LLM will leave them as None when not found
  3. Keep schemas focused — 5–15 fields per schema works best
  4. Use German-friendly descriptions — GAEB text is typically in German
class ConcreteSpec(BaseModel):
    """Extracted attributes for concrete work items."""
    strength_class: str = Field("", description="Betongüte, e.g. C25/30, C30/37")
    exposure_class: str | None = Field(None, description="Expositionsklasse, e.g. XC1, XD2")
    thickness_mm: int | None = Field(None, description="Dicke/Stärke in mm")
    reinforcement: bool = Field(False, description="Bewehrung vorhanden")
    formwork: str | None = Field(None, description="Schalungsart")

Extraction Results

Results are stored on the item:

for item, spec in doors:
    # spec is a DoorSpec instance
    print(spec.model_dump())

    # Also stored on the item for later access
    result = item.extractions["DoorSpec"]
    print(result.schema_name)     # "DoorSpec"
    print(result.data)            # dict of extracted values
    print(result.completeness)    # 0.0–1.0 (fraction of non-default fields)
    print(result.cached)          # bool — was this from cache?

Caching

Extraction results are cached the same way as classification. See the Caching Guide for details on in-memory, SQLite, and custom backends.