Custom & Vendor Tags¶
GAEB files in the wild often contain vendor-specific tags, software extensions, or custom elements that go beyond the official DA XML schema. pyGAEB lets you access any XML element — including unknown ones — via raw element retention and XPath queries.
Enabling Raw XML Access¶
By default, pyGAEB discards the raw XML tree after parsing to save memory. To retain it, pass keep_xml=True:
This works with all parse methods:
doc = GAEBParser.parse_bytes(raw, filename="tender.X83", keep_xml=True)
doc = GAEBParser.parse_string(xml_text, keep_xml=True)
Accessing Custom Tags on Items¶
When keep_xml=True, every Item, OrderItem, BoQCtgy, AwardInfo, TradeOrder, and GAEBInfo object has a source_element attribute containing the original lxml _Element:
for item in doc.iter_items():
el = item.source_element
# Access a vendor-specific tag (namespaced)
ns = "{http://www.gaeb.de/GAEB_DA_XML/DA86/3.3}"
cost_code = el.find(f"{ns}VendorCostCode")
if cost_code is not None:
print(f"{item.short_text}: {cost_code.text}")
This works for both procurement (Item) and trade (OrderItem) documents.
Namespace prefix
GAEB DA XML elements are namespaced. When using el.find(), you must include the full namespace URI in braces. Use doc.raw_namespace to get it dynamically:
XPath Queries¶
For querying across the entire document tree, use doc.xpath(). The document namespace is available as the g: prefix:
# Find all VendorCostCode values
codes = doc.xpath("//g:VendorCostCode/text()")
print(codes) # ["RC-001", "RC-002", ...]
# Find items with a specific attribute
items = doc.xpath("//g:Item[@RNoPart='001']")
# Get totals from BoQ sections
totals = doc.xpath("//g:Totals/g:TotalGross/text()")
Namespace handling
If the file has no namespace (rare), XPath expressions work without the g: prefix:
DocumentAPI Helpers¶
The DocumentAPI class provides convenience methods for common custom tag operations:
from pygaeb import DocumentAPI
api = DocumentAPI(doc)
# XPath on the whole document
results = api.xpath("//g:VendorCostCode/text()")
# Get a custom tag's text from a specific item
for item in api.iter_items():
ns = f"{{{doc.raw_namespace}}}"
val = api.custom_tag(item, f"{ns}VendorCostCode")
if val:
print(f"{item.oz}: {val}")
custom_tag() returns None if the tag doesn't exist or if keep_xml was not enabled — it never raises.
Accessing Other Model Elements¶
Raw elements are available on all major model types:
doc = GAEBParser.parse("file.X83", keep_xml=True)
# GAEBInfo
doc.gaeb_info.source_element # <GAEBInfo> element
# AwardInfo (procurement)
doc.award.source_element # <Award> element
# Categories (procurement)
for _, _, ctgy in doc.award.boq.iter_hierarchy():
if ctgy and ctgy.source_element is not None:
# Access category-level custom tags
pass
For trade documents, the TradeOrder also retains its raw element:
doc = GAEBParser.parse("order.X96", keep_xml=True)
# TradeOrder
doc.order.source_element # <Order> element
# OrderItems
for item in doc.order.items:
el = item.source_element # <OrderItem> element
Real-World Example¶
A file from a German AVA software might include custom cost tracking tags:
<Item RNoPart="0010">
<ShortText>Mauerwerk KS 240mm</ShortText>
<Qty>100</Qty>
<QU>m2</QU>
<UP>45.50</UP>
<IT>4550.00</IT>
<!-- Vendor extensions -->
<VendorCostCode>RC-001</VendorCostCode>
<CustomNote>Priority item</CustomNote>
</Item>
Extract these without modifying the parser:
doc = GAEBParser.parse("vendor_file.X83", keep_xml=True)
ns = f"{{{doc.raw_namespace}}}"
for item in doc.iter_items():
el = item.source_element
code = el.find(f"{ns}VendorCostCode")
note = el.find(f"{ns}CustomNote")
print(f"{item.short_text}: code={code.text if code is not None else 'N/A'}, "
f"note={note.text if note is not None else 'N/A'}")
Memory Considerations¶
When keep_xml=False (the default), the lxml tree is garbage-collected after parsing. With keep_xml=True, the entire XML tree remains in memory alongside the Pydantic models. For large files (10,000+ items), this roughly doubles memory usage.
| Mode | Memory | source_element |
xpath() |
|---|---|---|---|
keep_xml=False |
Normal | None |
Raises RuntimeError |
keep_xml=True |
~2x | lxml _Element |
Full XPath support |
Releasing XML memory with discard_xml()¶
After you're done with XPath queries and source_element access, call discard_xml() to release the lxml tree and all element references:
doc = GAEBParser.parse("large_file.X83", keep_xml=True)
# Do your custom tag work
codes = doc.xpath("//g:VendorCostCode/text()")
for item in doc.iter_items():
el = item.source_element
# ... extract vendor data ...
# Release XML memory — doc still works for iteration, writing, etc.
doc.discard_xml()
# doc.xpath() would now raise RuntimeError
# item.source_element is now None
This is especially useful in pipelines processing many files — parse with keep_xml=True, extract what you need, then discard_xml() before moving to the next file.
Tip
post_parse_hook and collect_raw_data automatically handle this for you — they enable keep_xml internally, run hooks, then discard. See the Extensibility Guide.
For most use cases, the memory overhead is negligible. If you're processing many large files in a pipeline, parse with keep_xml=False for standard data and only re-parse specific files with keep_xml=True when custom tag access is needed.