如何使用 Python 压缩 XML 中的空白字符
compressed_xml_output.xml
<root><parent><child><subchild> This is some text with irregular spacing. </subchild><subchild>Another piece of text with\n newlines and tabs.</subchild></child><child><subchild>Text with\n multiple\n lines.</subchild><subchild> Leading and trailing spaces </subchild></child></parent><parent><child><subchild>Mixed whitespace types.</subchild><subchild> </subchild></child></parent></root>以下 Python 函数将使用 lxml 快速 XML 库压缩 XML 字符串中不必要的空白字符。它不会压缩文本节点中的空白字符,只会压缩 XML 结构本身中的空白字符。
compress_xml_whitespace.py
import io
from lxml import etree
def compress_xml_whitespace(input_bytes: io.BytesIO) -> io.BytesIO:
"""
Compresses the whitespace in an XML file.
Args:
input_bytes (io.BytesIO): The input BytesIO object containing the XML content.
Returns:
io.BytesIO: The output BytesIO object containing the compressed XML content.
"""
input_bytes.seek(0) # Reset the position to the beginning of the BytesIO object
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(input_bytes, parser)
# Convert the XML tree to a string without pretty printing (no newlines or indentation)
compressed_xml = etree.tostring(tree, pretty_print=False, encoding='utf-8')
# Write the compressed XML to the BytesIO output
output_bytesio = io.BytesIO(compressed_xml)
# Reset the pointer of the output BytesIO to the beginning
output_bytesio.seek(0)
return output_bytesio演示
compress_xml_whitespace_demo.py
test_xml = """<root>
<parent>
<child>
<subchild> This is some text with irregular spacing. </subchild>
<subchild>Another piece of text with
newlines and tabs.</subchild>
</child>
<child>
<subchild>Text with
multiple
lines.</subchild>
<subchild> Leading and trailing spaces </subchild>
</child>
</parent>
<parent>
<child>
<subchild>Mixed whitespace types.</subchild>
<subchild> </subchild>
</child>
</parent>
</root>"""
test_xml_bytesio = io.BytesIO(test_xml.encode('utf-8'))
compress_xml_whitespace(test_xml_bytesio).read().decode('utf-8')输出:
output.txt
<root><parent><child><subchild> This is some text with irregular spacing. </subchild><subchild>Another piece of text with\n newlines and tabs.</subchild></child><child><subchild>Text with\n multiple\n lines.</subchild><subchild> Leading and trailing spaces </subchild></child></parent><parent><child><subchild>Mixed whitespace types.</subchild><subchild> </subchild></child></parent></root>If this post helped you, please consider buying me a coffee or donating via PayPal to support research & publishing of new posts on TechOverflow