Document Diff¶

Compare two GAEB procurement documents to identify item-level, structural, and metadata changes with significance-classified results.

Quick Start¶

from pygaeb import GAEBParser, BoQDiff

doc_a = GAEBParser.parse("tender_v1.X83")
doc_b = GAEBParser.parse("tender_v2.X83")

result = BoQDiff.compare(doc_a, doc_b)

print(result.summary.total_changes)     # 12
print(result.summary.financial_impact)  # Decimal("45230.00")
print(result.summary.max_significance)  # Significance.CRITICAL

How Matching Works¶

Items are matched by OZ (ordinal number) within each lot. This is the standard identity key in GAEB files.

For each lot in both documents, items are indexed by (lot_rno, oz).
Items sharing the same key are considered the same position.
Unmatched items fall back to a global OZ match — this catches items that moved between lots.
Items still unmatched after both passes are reported as added or removed.

The match_ratio (0.0–1.0) in the summary tells you how many items were successfully paired.

Significance Levels¶

Every field change is classified by its impact in a construction context:

Level	Fields	Meaning
CRITICAL	`unit_price`, `total_price`, `qty`	Directly affects contract value
HIGH	`unit`, `item_type`, `discount_pct`	Changes the nature of the position
MEDIUM	`short_text`, `vat`, `markup_type`	Descriptive or tax-related
LOW	`bim_guid`, `change_order_number`	Reference/metadata only

Item-Level Changes¶

# Items added in doc_b
for item in result.items.added:
    print(f"+ {item.oz}: {item.short_text} ({item.total_price})")

# Items removed from doc_a
for item in result.items.removed:
    print(f"- {item.oz}: {item.short_text} ({item.total_price})")

# Items modified (present in both, with field changes)
for mod in result.items.modified:
    print(f"~ {mod.oz}: {mod.max_significance.value}")
    for change in mod.changes:
        print(f"    {change.field}: {change.old_value} → {change.new_value}")
        if change.is_numeric:
            print(f"    delta: {change.absolute_delta} ({change.percent_delta:.1f}%)")

Filtering by Significance¶

Focus on what matters most:

from pygaeb import Significance

# Only items with at least one CRITICAL change
critical = result.items.filter_modified(Significance.CRITICAL)

# Per-item filtering
for mod in result.items.modified:
    high_plus = mod.filter_changes(Significance.HIGH)
    if high_plus:
        print(f"{mod.oz}: {len(high_plus)} important changes")

Structural Changes¶

Detects changes to the category/section hierarchy:

# New sections
for sec in result.structure.sections_added:
    print(f"New: {sec.label} ({sec.item_count} items)")

# Removed sections
for sec in result.structure.sections_removed:
    print(f"Removed: {sec.label} ({sec.item_count} items)")

# Renamed sections
for sec in result.structure.sections_renamed:
    print(f"Renamed: {sec.old_label} → {sec.new_label}")

# Items that moved to a different category
for moved in result.structure.items_moved:
    print(f"Moved: {moved.oz} from {moved.old_category_rno} to {moved.new_category_rno}")

Metadata Changes¶

Document-level changes (currency, project info, exchange phase):

for meta in result.metadata:
    print(f"{meta.field}: {meta.old_value} → {meta.new_value}")

Diff Modes¶

Control how strictly the engine validates document compatibility:

from pygaeb import DiffMode

# DEFAULT: adds warnings for mismatched projects (recommended)
result = BoQDiff.compare(doc_a, doc_b)

# STRICT: raises ValueError if documents are from different projects
result = BoQDiff.compare(doc_a, doc_b, mode=DiffMode.STRICT)

# FORCE: suppresses all compatibility warnings
result = BoQDiff.compare(doc_a, doc_b, mode=DiffMode.FORCE)

Warnings¶

The engine generates human-readable warnings for potentially misleading comparisons:

for warning in result.warnings:
    print(f"⚠ {warning}")

Possible warnings:

Documents may be from different projects
Currency changed (financial comparisons may be misleading)
Low match ratio (documents may be unrelated)
GAEB versions differ

Summary Properties¶

s = result.summary

s.has_changes          # bool — any changes at all?
s.total_changes        # int — added + removed + modified
s.items_added          # int
s.items_removed        # int
s.items_modified       # int
s.items_unchanged      # int
s.match_ratio          # float (0.0–1.0)
s.is_likely_same_project  # bool
s.financial_impact     # Decimal or None
s.max_significance     # Significance enum

Serialization¶

DiffResult is a Pydantic model — serialize to JSON for storage or API responses:

json_str = result.model_dump_json(indent=2)
restored = DiffResult.model_validate_json(json_str)

Limitations¶

Only procurement documents (X80–X89) are supported. Trade, cost, and quantity documents raise TypeError.
Matching is OZ-based. If items are renumbered between revisions without any shared OZ, they will appear as added/removed rather than modified.
The diff engine is deterministic — no LLM is used. Semantic matching of renumbered items is planned as an optional enrichment layer.