使用BeautifulSoup库解析htm、xml文档-CFANZ编程社区

BeautifulSoup 安装：

~/Desktop$ sudo pip install beautifulsoup4

测试：

from bs4 import BeautifulSoup

if __name__ == "__main__":
    # 第一个参数是html文档文本，第二个参数是指定的解析器
    soup = BeautifulSoup('<p>data</p>', 'html.parser')
    print(soup.prettify())

输出：

<p>
 data
</p>

说明安装成功了。

Beautiful Soup库也叫bs4，Beautiful Soup库是解析、遍历、维护 “标签树”的功能库。

Beautiful Soup库解析器：

解析器	使用方法	条件
bs4的HTML解析器	BeatifulSoup(mk,‘html.parser’)	pip install bs4
lxml的HTML解析器	BeatifulSoup(mk,‘lxml’)	pip install lxml
lxml的XML解析器	BeatifulSoup(mk,‘xml’)	pip install lxml
html5lib的解析器	BeatifulSoup(mk,‘html5lib’)	pip install html5lib

Beatiful Soup类的基本元素

基本元素	说明
Tag	标签，最基本的信息组织元素，分别用<>和</>标明开头和结尾。
Name	标签的名字，<p>…</p>的名字是‘p’，格式:<tag>.name
Attributes	标签的属性，字典形式组织，格式:<tag>.attrs
NavigableString	标签内非属性字符串，<>…</p> 中的字符串，格式:<tag>.string
Comment	标签内字符串的注释部分，一种特殊的Comment类型

示例：

import requests
from bs4 import BeautifulSoup


def handle_url(url):
    try:
        r = requests.get("http://www.baidu.com")
        r.raise_for_status()
        if r.encoding == 'ISO-8859-1':
            r.encoding = r.apparent_encoding
        demo = r.text
        soup = BeautifulSoup(demo, 'html.parser')
        # a标签有很多个，但soup.a返回第一个
        print(soup.a)
        # <class 'bs4.element.Tag'>
        print(type(soup.a))
        # 标签名a
        print(soup.a.name)
        # <class 'str'>
        print(type(soup.a.name))
        # 标签内的属性的字典，键值对
        print(soup.a.attrs)
        # <class 'dict'>
        print(type(soup.a.attrs))
        # 获取a标签的href属性值
        print(soup.a.attrs['href'])
        # <class 'str'>
        print(type(soup.a.attrs['href']))
        # 标签的内容
        print(soup.a.string)
        # a标签的父元素
        print(soup.a.parent)
    except:
        print("fail fail fail")


if __name__ == "__main__":
    url = "http://www.baidu.com"
    handle_url(url)

Beatiful Soup遍历HTML元素

Html具有树型结构，因此遍历有三种：
下行遍历：

属性	说明
.contents	子节点的列表，将<tag> 所有儿子节点存入列表
.children	子节点的迭代类型，用于循环遍历儿子节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

import requests
from bs4 import BeautifulSoup


def handle_url(url):
    try:
        r = requests.get("http://www.baidu.com")
        r.raise_for_status()
        if r.encoding == 'ISO-8859-1':
            r.encoding = r.apparent_encoding
        demo = r.text
        soup = BeautifulSoup(demo, 'html.parser')
        print(soup.head)
        # head标签的儿子节点
        print(soup.head.contents)
        # 是list列表类型
        print(type(soup.head.contents))
        # head有5个儿子节点
        print(len(soup.head.contents))
        # 取出head的第5个儿子节点
        print(soup.head.contents[4])
        # 使用children遍历儿子节点
        for child in soup.head.children:
            print(child)
        # 使用descendants遍历子孙节点
        for child in soup.head.descendants:
            print(child)
    except:
        print("fail fail fail")


if __name__ == "__main__":
    url = "http://www.baidu.com"
    handle_url(url)

上行遍历：

属性	说明
.parent	节点的父亲标签
.parents	节点先辈的迭代类型，用于循环遍历先辈节点

import requests
from bs4 import BeautifulSoup


def handle_url(url):
    try:
        r = requests.get("http://www.baidu.com")
        r.raise_for_status()
        if r.encoding == 'ISO-8859-1':
            r.encoding = r.apparent_encoding
        demo = r.text
        soup = BeautifulSoup(demo, 'html.parser')
        # html标签的父节点是它自己
        print(soup.html.parent)
        # soup本身也是一种特殊的标签节点，它的父节点是None空
        print(soup.parent)
        # title标签的父节点
        print(soup.title.parent)
        # 遍历title标签的先辈节点
        for parent in soup.title.parents:
            if parent is None:
                print(parent)
            else:
                print(parent.name)
    except:
        print("fail fail fail")


if __name__ == "__main__":
    url = "http://www.baidu.com"
    handle_url(url)

平行遍历：：必须发生在同一个父节点下

属性	说明
.next_sibling	返回按照 HTML文本顺序的下一个平等节点标签
.previous_sibling	返回按照 HTML文本顺序的上一个平等节点标签
.next_siblings	迭代类型，返回按照HTML文本顺序的后续所有平等节点标签
.previous_siblings	迭代类型，返回按照HTML文本顺序的前续所有平等节点标签

import requests
from bs4 import BeautifulSoup


def handle_url(url):
    try:
        r = requests.get("http://www.baidu.com")
        r.raise_for_status()
        if r.encoding == 'ISO-8859-1':
            r.encoding = r.apparent_encoding
        demo = r.text
        soup = BeautifulSoup(demo, 'html.parser')
        # title的前一个平行节点
        print(soup.title.previous_sibling)
        # link的下一个平行节点
        print(soup.link.next_sibling)
        # 遍历meta标签的所有的后续平行节点
        for sibling in soup.meta.next_siblings:
            print(sibling)
        # 遍历title标签的所有前续的平行节点
        for sibling in soup.title.previous_siblings:
            print(sibling)

    except:
        print("fail fail fail")


if __name__ == "__main__":
    url = "http://www.baidu.com"
    handle_url(url)