Python爬虫与信息提取系列(二)-CFANZ编程社区

beautifulSoup “美味的汤，绿色的浓汤”

一个灵活又方便的网页解析库，处理高效，支持多种解析器。
利用它就不用编写正则表达式也能方便的实现网页信息的抓取

Beautiful Soup 安装

使用pip 安装

pip install beautifulsoup

方法 :

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>data</p>' ,'html.parser')

获取网站信息,并用beautifulsoup 进行html解析

import requests
from bs4 import BeautifulSoup
url = "https://python123.io/ws/demo.html"
r = requests.get(url)
demo = r.text

soup = BeautifulSoup(demo,"html.parser") # 对demo 进行html解析
print(soup.prettify())

Python爬虫与信息提取系列(二)_python

Python爬虫与信息提取系列(二)_html解析_02

Python爬虫与信息提取系列(二)_html_03

import requests
from bs4 import BeautifulSoup
url = "https://python123.io/ws/demo.html"
r = requests.get(url)
demo = r.text

soup = BeautifulSoup(demo,"html.parser") # 对demo 进行html解析 
print(soup.title) #打印标签
print(soup.a.parent.parent.name) #打印 标签a的父亲的父亲的名字 
#output
#<title>This is a python demo page</title>
#body

import requests
from bs4 import BeautifulSoup
url = "https://python123.io/ws/demo.html"
r = requests.get(url)
demo = r.text

soup = BeautifulSoup(demo,"html.parser") # 对demo 进行html解析
tag = soup.a
print(tag.attrs)#获取标签的属性
print(tag.attrs['class'])
print(tag.attrs['href'])
#output
#{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
#['py1']
#http://www.icourse163.org/course/BIT-268001 '''

import requests
from bs4 import BeautifulSoup
url = "https://python123.io/ws/demo.html"
r = requests.get(url)
demo = r.text

soup = BeautifulSoup(demo,"html.parser") # 对demo 进行html解析

newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser")
print(newsoup.b.string ,"类型 : ", type(newsoup.b.string))
#This is a comment 类型 :  <class 'bs4.element.Comment'>

Python爬虫与信息提取系列(二)_html解析_04

for child in soup.body.children: #下行遍历
    if child == '\n':
        continue
    print(child) #遍历儿子节点

标签树的上行遍历

for parent in soup.a.parents: #标签数的上行遍历
    if parent is None:
        print(parent)
    else:
        print(parent.name)

标签树的平行遍历

for sibling in soup.a.next_siblings: #遍历平行后续节点 ,节点不一定是标签节点,也可能是字符串 
    print(sibling)
for sibling in soup.a.previous_siblings:
    print(sibling)

基于bs4 库的HTML 格式化和编码

如何让 html内容更友好 ?

import requests
from bs4 import BeautifulSoup
url = "https://python123.io/ws/demo.html"
r = requests.get(url)
demo = r.text

soup = BeautifulSoup(demo,"html.parser") # 对demo 进行html解析
print(soup.a.prettify() ) #让a标签更清晰

信息的标记

标记后的信息可形成信息组织结构,增加信息维度
标记后的信息可用于通信 ,存储 ,或展示
标记的结构与信息一样具有重要价值
标记后的信息更利于程序理解和运用

信息标记的三种形式及比较
XML(eXtensible Markup Language)是最早的通用信息标记语言,可扩展性好,但繁琐;标签由名字和属性构成,形式有:

<name>...</name>
<name />
<!--   -->

JSON(JavaScript Objection Notation)适合程序处理,较XML简洁;有类型的键值对,形式有:

"key":"value"
"key":["value1","value2"]
"key":{"subkey":"subvalue"}

YAML(YAML Ain't Markup Language)文本信息比例最高,可读性好;无类型的键值对,形式有:

key:value
key:#Comment
-value1
-value2
key:
  subkey:subvalue