Parsing¶
pyGAEB auto-detects the version, format, and encoding of any GAEB file, producing a unified GAEBDocument regardless of the input version.
Basic Parsing¶
This single call handles:
- Format detection — determines if the file is DA XML (2.x or 3.x) or GAEB 90
- Version detection — reads the XML namespace and
<Version>element - Encoding repair — fixes BOM, Windows-1252 masquerading as UTF-8, mojibake
- XML recovery — handles bare ampersands, unclosed tags, and other real-world damage
- Structural parsing — builds the full BoQ hierarchy
- Validation — runs numeric, item, and phase validation
Parse from Memory¶
For web uploads, S3 objects, or database blobs:
doc = GAEBParser.parse_bytes(raw_bytes, filename="tender.X83")
doc = GAEBParser.parse_string(xml_text, filename="input.X84")
The filename parameter provides the extension hint used for phase detection (.X83 = tender, .X84 = bid, etc.).
Version Detection¶
pyGAEB detects the version from multiple signals in priority order:
- XML namespace — most reliable (e.g.,
http://www.gaeb.de/GAEB_DA_XML/DA86/3.3) <Version>element — explicit version tag in<GAEBInfo>- File extension — fallback for phase detection (
.X83,.D83, etc.)
doc = GAEBParser.parse("tender.X83")
print(doc.source_version) # SourceVersion.DA_XML_33
print(doc.exchange_phase) # ExchangePhase.X83
Parser Tracks¶
| Track | Versions | Element Language |
|---|---|---|
| Track A | DA XML 2.0, 2.1 | German (Leistungsverzeichnis, Position, Menge...) |
| Track B | DA XML 3.0–3.3 | English (BoQ, Item, Qty...) |
| Track C | GAEB 90 | Fixed-width records (planned v1.1) |
All tracks produce the same GAEBDocument model.
Encoding Repair¶
Real-world GAEB files frequently have encoding issues. pyGAEB handles them transparently:
- BOM stripping — removes UTF-8/UTF-16 byte order marks
- Charset detection — uses
charset-normalizerto detect Windows-1252, Latin-1, etc. - Mojibake repair — uses
ftfyto fix double-encoded characters (common with German umlauts)
The encoding repair runs before XML parsing, so even files with corrupted encoding declarations are handled.
XML Recovery¶
If standard lxml parsing fails, pyGAEB falls back to a recovery pipeline:
- BeautifulSoup tolerant parser — handles bare
&, unclosed tags, and malformed attributes - Warnings are added to
doc.validation_results— nothing is silently lost
doc = GAEBParser.parse("damaged.X83")
for issue in doc.validation_results:
if "recovery" in issue.message.lower():
print(f"Recovered: {issue.message}")
Exchange Phases¶
GAEB defines workflows through exchange phases. Procurement phases cover the tendering process, trade phases cover material ordering between contractors and suppliers, and cost phases cover construction cost estimation and calculation.
Procurement Phases (X80–X89)¶
| Phase | Purpose | Typical Extension |
|---|---|---|
| X80 | Request for proposal | .X80 |
| X81 | Cost estimate | .X81 |
| X82 | Specification | .X82 |
| X83 | Tender / Bill of quantities | .X83 |
| X84 | Bid / Offer | .X84 |
| X85 | Award | .X85 |
| X86 | Invoice | .X86 |
| X89 | Cost planning | .X89 |
Trade Phases (X93–X97)¶
| Phase | Purpose | Typical Extension |
|---|---|---|
| X93 | Trade Price Inquiry | .X93 |
| X94 | Trade Price Offer | .X94 |
| X96 | Trade Order | .X96 |
| X97 | Trade Order Confirmation | .X97 |
Trade phases use a different XML structure (<Order>/<OrderItem> instead of <Award>/<BoQ>/<Item>), but pyGAEB handles this transparently. See the Trade Phases Guide for details.
Cost & Calculation Phases (X50–X52)¶
| Phase | Purpose | Typical Extension |
|---|---|---|
| X50 | Construction Cost Catalog | .X50 |
| X51 | Cost Determination | .X51 |
| X52 | Calculation Approaches | .X52 |
X50 and X51 use a different XML structure (<ElementalCosting>/<ECBody>/<CostElement> instead of <Award>/<BoQ>/<Item>), introducing DocumentKind.COST. X52 extends the standard procurement structure with per-item calculation data. See the Cost & Calculation Phases Guide for details.
Quantity Determination Phase (X31)¶
| Phase | Purpose | Typical Extension |
|---|---|---|
| X31 | Quantity Take-Off / Measurements | .X31 |
X31 uses a different XML structure (<QtyDeterm>/<BoQ>/<QtyItem> with REB 23.003 measurement rows instead of procurement items), introducing DocumentKind.QUANTITY. Items carry no descriptions or prices — only OZ numbers and measurement data. See the Quantity Determination Guide for details.
DA XML 2.x uses D-prefixed phases (D83, D84, etc.) which are automatically normalized to X-prefixed canonical form:
doc = GAEBParser.parse("old.D83")
print(doc.exchange_phase) # ExchangePhase.D83
print(doc.exchange_phase.normalized()) # ExchangePhase.X83
Retaining Raw XML¶
By default the lxml tree is discarded after parsing to conserve memory. Pass keep_xml=True to retain it, which enables custom tag access and XPath queries:
doc = GAEBParser.parse("tender.X83", keep_xml=True)
# XPath against the full tree (namespace prefix "g:" is auto-mapped)
codes = doc.xpath("//g:VendorCostCode/text()")
# Access the raw lxml element on any item
for item in doc.iter_items():
el = item.source_element # lxml _Element or None
See the Custom & Vendor Tags Guide for full details.
Unified Document Model¶
Regardless of input version, you always get the same GAEBDocument. The document discriminates between procurement, trade, cost, and quantity workflows:
doc.source_version # SourceVersion enum
doc.exchange_phase # ExchangePhase enum
doc.document_kind # DocumentKind.PROCUREMENT, TRADE, COST, or QUANTITY
doc.gaeb_info # GAEBInfo (software metadata)
doc.validation_results # list[ValidationResult]
doc.grand_total # Decimal (sum of affecting items)
doc.item_count # int
Procurement documents (X80–X89)¶
Project metadata from <PrjInfo> is merged into AwardInfo:
doc.award.project_name # str — project name
doc.award.prj_id # str — project GUID
doc.award.lbl_prj # str — project label
doc.award.description # str — project description
doc.award.currency_label # str — e.g., "Euro"
doc.award.up_frac_dig # int (2 or 3) — unit price decimal places
doc.award.bid_comm_perm # bool — bidder comments permitted
doc.award.alter_bid_perm # bool — alternative bids permitted
Financial summaries are parsed from <Totals> elements on BoQInfo, categories, and lots. These carry the authoritative net/gross totals, VAT rates, VAT breakdowns, and discount data — present in X84 (bid), X86 (award), and X89 (invoice) files:
totals = doc.award.boq.boq_info.totals
if totals:
totals.total # Decimal — sum before discounts
totals.total_net # Decimal — net after discounts
totals.total_gross # Decimal — gross (net + VAT)
totals.vat # Decimal — VAT rate %
totals.vat_amount # Decimal — total VAT amount
totals.discount_pcnt # Decimal — discount percentage
totals.vat_parts # list[VATPart] — per-rate breakdown
Item-level VAT — each item can carry its own VAT rate:
Trade documents (X93–X97)¶
Cost documents (X50, X51)¶
doc.elemental_costing # ElementalCosting (cost hierarchy + BIM properties)
doc.elemental_costing.body.iter_cost_elements() # recursive element iteration
Quantity determination documents (X31)¶
doc.qty_determination # QtyDetermination (measurement data + catalogs)
doc.qty_determination.boq.ref_boq_name # referenced procurement BoQ
Universal iteration¶
Works for all document kinds:
See the Models Reference for full details on every field.
Advanced Parsing Options¶
File size limit¶
By default, pyGAEB rejects files larger than 100 MB to prevent memory exhaustion. You can adjust or disable this:
# Override per call (in bytes)
doc = GAEBParser.parse("huge.X83", max_file_size=500 * 1024 * 1024) # 500 MB
# Disable the check
doc = GAEBParser.parse("huge.X83", max_file_size=0)
# Change the global default
from pygaeb import configure
configure(max_file_size_mb=200)
Per-call validators¶
Pass extra validation rules to a single parse call without registering them globally:
from pygaeb.models.item import ValidationResult
from pygaeb.models.enums import ValidationSeverity
def require_unit(doc):
issues = []
for item in doc.iter_items():
if not item.unit:
issues.append(
ValidationResult(
severity=ValidationSeverity.WARNING,
message=f"{item.oz}: missing unit",
)
)
return issues
doc = GAEBParser.parse("tender.X83", extra_validators=[require_unit])
See the Extensibility Guide for global validator registration.
Post-parse hook¶
Use post_parse_hook to inspect or mutate each item right after parsing:
def extract_vendor_codes(item, el):
if el is None:
return
ns = {"g": "http://www.gaeb.de/GAEB_DA_XML/DA86/3.3"}
codes = el.findall(".//g:VendorCostCode", ns)
if codes:
item.raw_data = item.raw_data or {}
item.raw_data["vendor_codes"] = [c.text for c in codes]
doc = GAEBParser.parse("file.X83", post_parse_hook=extract_vendor_codes)
Collecting unknown XML elements¶
Set collect_raw_data=True to automatically populate item.raw_data with child elements not consumed by the built-in parser:
doc = GAEBParser.parse("file.X83", collect_raw_data=True)
for item in doc.iter_items():
if item.raw_data:
print(f"{item.oz}: {item.raw_data}")
See the Extensibility Guide for more extension points.