Extractor¶

Structured data extraction using LLMs and user-defined Pydantic schemas.

StructuredExtractor¶

StructuredExtractor ¶

StructuredExtractor(model=None, fallbacks=None, cache=None, concurrency=None)

Extract structured attributes from BoQ items into user-defined Pydantic schemas.

Workflow

Base classification tags items (Door, Wall, Pipe, etc.)
User defines a Pydantic schema for a category
StructuredExtractor uses LLM to extract typed attributes from item text

Usage

Default: in-memory cache (no disk)¶

extractor = StructuredExtractor(model="anthropic/claude-sonnet-4-6")

Persistent: SQLite cache (opt-in)¶

from pygaeb.cache import SQLiteCache extractor = StructuredExtractor(cache=SQLiteCache("~/.pygaeb/cache"))

By classification filter¶

results = await extractor.extract(doc, schema=DoorSpec, element_type="Door")

By explicit item list¶

results = await extractor.extract_items(my_items, schema=DoorSpec)

for item, spec in results: print(item.oz, spec.fire_rating, spec.width_mm)

extract `async` ¶

extract(doc, schema, trade=None, element_type=None, sub_type=None, on_progress=None, force_reextract=False, attach=True)

Extract structured data from items matching a classification filter.

Works for both procurement and trade documents.

Parameters:

Name	Type	Description	Default
`doc`	`GAEBDocument`	Parsed GAEB document (items should be classified first).	required
`schema`	`type[T]`	User-defined Pydantic model to extract into.	required
`trade`	`str \| None`	Filter by classification trade (Level 1).	`None`
`element_type`	`str \| None`	Filter by classification element_type (Level 2).	`None`
`sub_type`	`str \| None`	Filter by classification sub_type (Level 3).	`None`
`on_progress`	`ProgressCallback \| None`	Callback(completed, total, current_label).	`None`
`force_reextract`	`bool`	Bypass cache and re-extract all items.	`False`
`attach`	`bool`	If True, store results on item.extractions[schema_name].	`True`

Returns:

Type	Description
`list[tuple[Any, T]]`	List of (item, schema_instance) tuples.

extract_items `async` ¶

extract_items(items, schema, trade_context='', element_context='', on_progress=None, force_reextract=False, attach=True)

Extract structured data from an explicit list of items.

Use this when you want full control over which items to extract from, bypassing classification-based filtering.

extract_sync ¶

extract_sync(doc, schema, trade=None, element_type=None, sub_type=None, on_progress=None, force_reextract=False, attach=True)

Synchronous convenience wrapper.

estimate_cost `async` ¶

estimate_cost(doc, schema, trade=None, element_type=None, sub_type=None)

Estimate cost of extracting structured data from matching items.

Built-in Schemas¶

DoorSpec ¶

Bases: BaseModel

Structured specification for door items.

WindowSpec ¶

Bases: BaseModel

Structured specification for window items.

WallSpec ¶

Bases: BaseModel

Structured specification for wall items.

PipeSpec ¶

Bases: BaseModel

Structured specification for pipe items.

Schema Utilities¶

schema_utils ¶

Schema utilities: hashing, field introspection, completeness scoring.

compute_schema_hash ¶

compute_schema_hash(schema)

Deterministic hash of a Pydantic model's JSON schema.

Changes when the user modifies their schema — triggers cache miss.

compute_extraction_cache_key ¶

compute_extraction_cache_key(item_text_hash, schema_hash)

Cache key combining item text hash + schema hash.

get_schema_name ¶

get_schema_name(schema)

Human-readable name of the schema class.

get_field_descriptions ¶

get_field_descriptions(schema)

Extract field name → description mapping from Pydantic model.

Uses Field(description=...) if provided, otherwise the field type annotation.

compute_completeness ¶

compute_completeness(instance)

Score how many fields were populated vs total optional fields.

Fields with non-default, non-None values count as populated. Returns 0.0-1.0.

schema_field_summary ¶

schema_field_summary(schema)

One-line summary of schema fields for logging.

Extraction Cache¶

ExtractionCache ¶

ExtractionCache(backend=None)

Caches structured extraction results keyed by (item_hash + schema_hash).

Wraps any CacheBackend (default: InMemoryCache — no disk I/O).

get ¶

get(cache_key)

Retrieve cached extraction. Returns (data_dict, completeness) or None.

stats ¶

stats()

Aggregate counts by schema name and hash.

Extractor¶

StructuredExtractor¶

StructuredExtractor ¶

Default: in-memory cache (no disk)¶

Persistent: SQLite cache (opt-in)¶

By classification filter¶

By explicit item list¶

extract async ¶

extract_items async ¶

extract_sync ¶

estimate_cost async ¶

Built-in Schemas¶

DoorSpec ¶

WindowSpec ¶

WallSpec ¶

PipeSpec ¶

Schema Utilities¶

schema_utils ¶

compute_schema_hash ¶

compute_extraction_cache_key ¶

get_schema_name ¶

get_field_descriptions ¶

compute_completeness ¶

schema_field_summary ¶

Extraction Cache¶

ExtractionCache ¶

get ¶

stats ¶

extract `async` ¶

extract_items `async` ¶

estimate_cost `async` ¶