Extractor¶
Structured data extraction using LLMs and user-defined Pydantic schemas.
StructuredExtractor¶
StructuredExtractor
¶
Extract structured attributes from BoQ items into user-defined Pydantic schemas.
Workflow
- Base classification tags items (Door, Wall, Pipe, etc.)
- User defines a Pydantic schema for a category
- StructuredExtractor uses LLM to extract typed attributes from item text
Usage
Default: in-memory cache (no disk)¶
extractor = StructuredExtractor(model="anthropic/claude-sonnet-4-6")
Persistent: SQLite cache (opt-in)¶
from pygaeb.cache import SQLiteCache extractor = StructuredExtractor(cache=SQLiteCache("~/.pygaeb/cache"))
By classification filter¶
results = await extractor.extract(doc, schema=DoorSpec, element_type="Door")
By explicit item list¶
results = await extractor.extract_items(my_items, schema=DoorSpec)
for item, spec in results: print(item.oz, spec.fire_rating, spec.width_mm)
extract
async
¶
extract(doc, schema, trade=None, element_type=None, sub_type=None, on_progress=None, force_reextract=False, attach=True)
Extract structured data from items matching a classification filter.
Works for both procurement and trade documents.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc
|
GAEBDocument
|
Parsed GAEB document (items should be classified first). |
required |
schema
|
type[T]
|
User-defined Pydantic model to extract into. |
required |
trade
|
str | None
|
Filter by classification trade (Level 1). |
None
|
element_type
|
str | None
|
Filter by classification element_type (Level 2). |
None
|
sub_type
|
str | None
|
Filter by classification sub_type (Level 3). |
None
|
on_progress
|
ProgressCallback | None
|
Callback(completed, total, current_label). |
None
|
force_reextract
|
bool
|
Bypass cache and re-extract all items. |
False
|
attach
|
bool
|
If True, store results on item.extractions[schema_name]. |
True
|
Returns:
| Type | Description |
|---|---|
list[tuple[Any, T]]
|
List of (item, schema_instance) tuples. |
extract_items
async
¶
extract_items(items, schema, trade_context='', element_context='', on_progress=None, force_reextract=False, attach=True)
Extract structured data from an explicit list of items.
Use this when you want full control over which items to extract from, bypassing classification-based filtering.
extract_sync
¶
extract_sync(doc, schema, trade=None, element_type=None, sub_type=None, on_progress=None, force_reextract=False, attach=True)
Synchronous convenience wrapper.
estimate_cost
async
¶
Estimate cost of extracting structured data from matching items.
Built-in Schemas¶
DoorSpec
¶
Bases: BaseModel
Structured specification for door items.
WindowSpec
¶
Bases: BaseModel
Structured specification for window items.
WallSpec
¶
Bases: BaseModel
Structured specification for wall items.
PipeSpec
¶
Bases: BaseModel
Structured specification for pipe items.
Schema Utilities¶
schema_utils
¶
Schema utilities: hashing, field introspection, completeness scoring.
compute_schema_hash
¶
Deterministic hash of a Pydantic model's JSON schema.
Changes when the user modifies their schema — triggers cache miss.
compute_extraction_cache_key
¶
Cache key combining item text hash + schema hash.
get_field_descriptions
¶
Extract field name → description mapping from Pydantic model.
Uses Field(description=...) if provided, otherwise the field type annotation.
compute_completeness
¶
Score how many fields were populated vs total optional fields.
Fields with non-default, non-None values count as populated. Returns 0.0-1.0.