python读取xml格式的文件-CFANZ编程社区

xml是一种可扩展的标记语言，是互联网中数据存储和传输的一种常用格式，遵循树状结构的方式，在各个节点中存储用户自定义的数据，一个xml文件示例如下

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

整个文档以固定的xml标记以及版本号开头，接下来以标签嵌套的形式构成，形成了一个树状结构，具有相同缩进的标签属于树状结构中的同一层级。

每个标签具备以下几个基本特征

标签名，比如上述列子中的data, country等就是标签名
属性，比如country标签中的name属性，以key=value的形式构成，一个标签可以有多个属性
内容，在标签之间的值，比如上述例子中第一个rank标签的内容为1

标签，属性，内容都可以根据用户的需求来自定义，所以xml文件非常的灵活。在python中，有多个模块都支持xml文件的处理，列表如下

xml.etree.ElementTree
xml.dom
xml.dom.minidom
xml.dom.pulldom
xml.parsers.expat

其中，第一个模块更加轻便简介，对于简单的xml文档，推荐使用。基本用法如下

>>> import xml.etree.ElementTree
>>> from  xml.etree.ElementTree import parse
>>> xml = parse('input.xml')
# 获取根节点标签
>>> root = xml.getroot()
# 对于每个节点，都要tag和attrib两个属性
# tag对应标签名
>>> root.tag
'data'
# attrib对应标签的属性，是一个字典
>>> root.attrib
{}

对于root节点，可以通过遍历的形式来访问对应的子节点，用法如下

>>> for child in root:
...     print(child.tag, child.attrib)
...
country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}

实际应用中，更多的是访问特定标签中的内容，通过iter方法可以根据标签名访问特定标签，用法如下

>>> for neighbor in root.iter('neighbor'):
...     print(neighbor.get('name'))
...
Austria
Switzerland
Malaysia
Costa Rica
Colombia

get方法用于获取特定属性的值，findall方法则可以根据标签名或者xpath语法访问特定标签，用法如下

>>> for country in root.findall("country"):
...     year = country.find('year')
...     print(year.text)
...
2008
2011
2011

上述代码中，find方法用于查找当前标签下的子标签，text属性对应标签中的内容。通过上述几个方法，已经可以轻松获取特定标签的内容了。

除此之外，该模块还支持通过xpah语法来寻找特定的标签，具体的用法请查看官方的API说明。

·end·

python读取xml格式的文件_xml

一个只分享干货的

生信公众号