How to make lxml#39;s iterparse ignore invalid XML characters?(如何让lxml的iterparse忽略无效的XML字符?)
本文介绍了如何让lxml的iterparse忽略无效的XML字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个包含无效字符的XML。 LXML的XMLParser会对这些无效字符抛出异常,但当我使用Recover=True选项创建XMLParser时,它会忽略错误字符并正常工作。
我的问题是如何为lxml的iterparse函数设置类似标志?
复制:
损坏的XML(/tmp/z.xml):
<?xml version="1.0" encoding="utf-8"?>
<items>
<item>
<B>Bad characters:</B>
</item>
</items>
注意:"Bad Characters:"字符串后面有两个ASCII字符#31(0x1F),我无法将其复制粘贴到此处。
XMLParser的解析错误:
fd = open('/tmp/z.xml')
parser = etree.XMLParser()
tree = etree.parse(fd, parser)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2576, in lxml.etree.parse (src/lxml/lxml.etree.c:22796)
File "parser.pxi", line 1488, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:60390)
File "parser.pxi", line 1518, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:60687)
File "parser.pxi", line 1401, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:59658)
File "parser.pxi", line 991, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:57303)
File "parser.pxi", line 538, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:53512)
File "parser.pxi", line 624, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:54372)
File "parser.pxi", line 564, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:53770)
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 31, line 4, column 21
要忽略错误字符,我设置了Recover=True,它运行正常:
import lxml.etree as etree
fd = open('/tmp/z.xml')
parser = etree.XMLParser(recover=True)
tree = etree.parse(fd, parser)
etree.tostring(tree)
# OUTPUT:
<items>
<item>
<B>Bad characters:</B>
</item>
</items>'
使用iterparse时,我再次收到相同的错误,但如何才能使其忽略错误字符?
fd = open('/tmp/z.xml')
it = etree.iterparse(fd, events=("start", "end"))
for e in it: print e
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "iterparse.pxi", line 498, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:73245)
File "parser.pxi", line 564, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:53770)
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 31, line 4, column 21
推荐答案
iterparse还接受recover参数:
it = etree.iterparse(fd, events=("start", "end"), recover=True)
(文档:lxml iterparse)
这篇关于如何让lxml的iterparse忽略无效的XML字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
编程基础网
本文标题为:如何让lxml的iterparse忽略无效的XML字符?
基础教程推荐
猜你喜欢
- pyserial - 可以从线程 a 写入串行端口,是否阻塞从线程 b 读取? 2022-01-01
- numpy float:比算术运算中内置的慢 10 倍? 2022-01-01
- 将 x 轴刻度更改为自定义字符串 2022-01-01
- 由Python将MP3转换为MIDI(类型错误:无法加载插件:mtg-Melodia:Melodia) 2022-01-01
- Discord.py 缺少必需的参数 2022-01-01
- 在 Celery 工作人员中捕获 Heroku SIGTERM 以优雅地关 2022-01-01
- 用 Python 编写 Fortran 无格式文件 2022-01-01
- 尝试制作WhatsApp机器人 2022-01-01
- 与常规 dict 相比,Python manager.dict() 非常慢 2022-01-01
- 使用生成器和迭代器时 Python 多循环失败 2022-01-01
