Skip to content

Extractor

Structured data extraction using LLMs and user-defined Pydantic schemas.

StructuredExtractor

StructuredExtractor

StructuredExtractor(model=None, fallbacks=None, cache=None, concurrency=None)

Extract structured attributes from BoQ items into user-defined Pydantic schemas.

Workflow
  1. Base classification tags items (Door, Wall, Pipe, etc.)
  2. User defines a Pydantic schema for a category
  3. StructuredExtractor uses LLM to extract typed attributes from item text
Usage

Default: in-memory cache (no disk)

extractor = StructuredExtractor(model="anthropic/claude-sonnet-4-6")

Persistent: SQLite cache (opt-in)

from pygaeb.cache import SQLiteCache extractor = StructuredExtractor(cache=SQLiteCache("~/.pygaeb/cache"))

By classification filter

results = await extractor.extract(doc, schema=DoorSpec, element_type="Door")

By explicit item list

results = await extractor.extract_items(my_items, schema=DoorSpec)

for item, spec in results: print(item.oz, spec.fire_rating, spec.width_mm)

extract async

extract(doc, schema, trade=None, element_type=None, sub_type=None, on_progress=None, force_reextract=False, attach=True)

Extract structured data from items matching a classification filter.

Works for both procurement and trade documents.

Parameters:

Name Type Description Default
doc GAEBDocument

Parsed GAEB document (items should be classified first).

required
schema type[T]

User-defined Pydantic model to extract into.

required
trade str | None

Filter by classification trade (Level 1).

None
element_type str | None

Filter by classification element_type (Level 2).

None
sub_type str | None

Filter by classification sub_type (Level 3).

None
on_progress ProgressCallback | None

Callback(completed, total, current_label).

None
force_reextract bool

Bypass cache and re-extract all items.

False
attach bool

If True, store results on item.extractions[schema_name].

True

Returns:

Type Description
list[tuple[Any, T]]

List of (item, schema_instance) tuples.

extract_items async

extract_items(items, schema, trade_context='', element_context='', on_progress=None, force_reextract=False, attach=True)

Extract structured data from an explicit list of items.

Use this when you want full control over which items to extract from, bypassing classification-based filtering.

extract_sync

extract_sync(doc, schema, trade=None, element_type=None, sub_type=None, on_progress=None, force_reextract=False, attach=True)

Synchronous convenience wrapper.

estimate_cost async

estimate_cost(doc, schema, trade=None, element_type=None, sub_type=None)

Estimate cost of extracting structured data from matching items.

Built-in Schemas

DoorSpec

Bases: BaseModel

Structured specification for door items.

WindowSpec

Bases: BaseModel

Structured specification for window items.

WallSpec

Bases: BaseModel

Structured specification for wall items.

PipeSpec

Bases: BaseModel

Structured specification for pipe items.

Schema Utilities

schema_utils

Schema utilities: hashing, field introspection, completeness scoring.

compute_schema_hash

compute_schema_hash(schema)

Deterministic hash of a Pydantic model's JSON schema.

Changes when the user modifies their schema — triggers cache miss.

compute_extraction_cache_key

compute_extraction_cache_key(item_text_hash, schema_hash)

Cache key combining item text hash + schema hash.

get_schema_name

get_schema_name(schema)

Human-readable name of the schema class.

get_field_descriptions

get_field_descriptions(schema)

Extract field name → description mapping from Pydantic model.

Uses Field(description=...) if provided, otherwise the field type annotation.

compute_completeness

compute_completeness(instance)

Score how many fields were populated vs total optional fields.

Fields with non-default, non-None values count as populated. Returns 0.0-1.0.

schema_field_summary

schema_field_summary(schema)

One-line summary of schema fields for logging.

Extraction Cache

ExtractionCache

ExtractionCache(backend=None)

Caches structured extraction results keyed by (item_hash + schema_hash).

Wraps any CacheBackend (default: InMemoryCache — no disk I/O).

get

get(cache_key)

Retrieve cached extraction. Returns (data_dict, completeness) or None.

stats

stats()

Aggregate counts by schema name and hash.