0
点赞
收藏
分享

微信扫一扫

Python爬虫与信息提取系列(二)


beautifulSoup “美味的汤,绿色的浓汤”

一个灵活又方便的网页解析库,处理高效,支持多种解析器。
利用它就不用编写正则表达式也能方便的实现网页信息的抓取

 Beautiful Soup 安装

 使用pip 安装

pip install beautifulsoup

方法 :   

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>data</p>' ,'html.parser')

 

获取网站信息,并用beautifulsoup 进行html解析  

import requests
from bs4 import BeautifulSoup
url = "https://python123.io/ws/demo.html"
r = requests.get(url)
demo = r.text

soup = BeautifulSoup(demo,"html.parser") # 对demo 进行html解析
print(soup.prettify())

Python爬虫与信息提取系列(二)_python

 

Python爬虫与信息提取系列(二)_html解析_02

 

Python爬虫与信息提取系列(二)_html_03

import requests
from bs4 import BeautifulSoup
url = "https://python123.io/ws/demo.html"
r = requests.get(url)
demo = r.text

soup = BeautifulSoup(demo,"html.parser") # 对demo 进行html解析
print(soup.title) #打印标签
print(soup.a.parent.parent.name) #打印 标签a的父亲的父亲的名字
#output
#<title>This is a python demo page</title>
#body

import requests
from bs4 import BeautifulSoup
url = "https://python123.io/ws/demo.html"
r = requests.get(url)
demo = r.text

soup = BeautifulSoup(demo,"html.parser") # 对demo 进行html解析
tag = soup.a
print(tag.attrs)#获取标签的属性
print(tag.attrs['class'])
print(tag.attrs['href'])
#output
#{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
#['py1']
#http://www.icourse163.org/course/BIT-268001 '''

 

import requests
from bs4 import BeautifulSoup
url = "https://python123.io/ws/demo.html"
r = requests.get(url)
demo = r.text

soup = BeautifulSoup(demo,"html.parser") # 对demo 进行html解析

newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser")
print(newsoup.b.string ,"类型 : ", type(newsoup.b.string))
#This is a comment 类型 : <class 'bs4.element.Comment'>

 

Python爬虫与信息提取系列(二)_html解析_04

for child in soup.body.children: #下行遍历
if child == '\n':
continue
print(child) #遍历儿子节点

  标签树的上行遍历

for parent in soup.a.parents: #标签数的上行遍历
if parent is None:
print(parent)
else:
print(parent.name)

标签树的平行遍历  

for sibling in soup.a.next_siblings: #遍历平行后续节点 ,节点不一定是标签节点,也可能是字符串 
print(sibling)
for sibling in soup.a.previous_siblings:
print(sibling)

基于bs4 库的HTML 格式化和编码

如何让 html内容更友好 ?

 

import requests
from bs4 import BeautifulSoup
url = "https://python123.io/ws/demo.html"
r = requests.get(url)
demo = r.text

soup = BeautifulSoup(demo,"html.parser") # 对demo 进行html解析
print(soup.a.prettify() ) #让a标签更清晰

信息的标记  

  1.        标记后的信息可形成信息组织结构,增加信息维度 
  2.        标记后的信息可用于通信 ,存储 ,或展示 
  3.        标记的结构与信息一样具有重要价值
  4.        标记后的信息更利于程序理解和运用

信息标记的三种形式及比较
XML(eXtensible Markup Language)是最早的通用信息标记语言,可扩展性好,但繁琐;标签由名字和属性构成,形式有:

<name>...</name>
<name />
<!-- -->

JSON(JavaScript Objection Notation)适合程序处理,较XML简洁;有类型的键值对,形式有:

 

"key":"value"
"key":["value1","value2"]
"key":{"subkey":"subvalue"}

YAML(YAML Ain't Markup Language)文本信息比例最高,可读性好;无类型的键值对,形式有:

key:value
key:#Comment
-value1
-value2
key:
subkey:subvalue

 

 

举报

相关推荐

0 条评论