0
点赞
收藏
分享

微信扫一扫

【爬虫剑谱】第三卷 拾遗篇 第三章 有关于bs4库中的BeautifulSoup模块使用小结

Python百事通 2022-01-26 阅读 25
爬虫python

关于关于bs4库中的BeautifulSoup模块在实战后的快速上手小结

一、BeautifulSoup 模块

1.将 Beautiful 对象实例化的两种方法

(1)将本地 HTML 文档转为 BeautifulSoup 对象

from bs4 import BeautifulSoup

fp = open('test1.html', encoding = 'utf-8')
soup = BeautifulSoup(fp, 'lxml')
print(soup)

在这里插入图片描述

(2)将爬到的网页源代码转为 BeautifulSoup 对象

from bs4 import BeautifulSoup
import requests

url =  'https://www.baidu.com'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
response = requests.get(url = url, headers = headers ).text
soup = BeautifulSoup(response, 'lxml')
print(soup)

在这里插入图片描述

2. BeautifulSoup 标签定位

(1)标签名定位

from bs4 import BeautifulSoup
fp = open('test1.html', encoding = 'utf-8')  # 读取Html文档
soup = BeautifulSoup(fp, 'lxml')  # 实例化对象
print(soup.p)

在这里插入图片描述

(2)标签属性定位

from bs4 import BeautifulSoup

fp = open('test1.html', encoding = 'utf-8')
soup = BeautifulSoup(fp, 'lxml')
print(soup.find(class_ = 'first'))
print(soup.find_all(class_ = 'first'))

在这里插入图片描述

(3)标签+属性定位

from bs4 import BeautifulSoup

fp = open('test1.html', encoding = 'utf-8')
soup = BeautifulSoup(fp, 'lxml')

print(soup.find('div', class_ = 'first'))
print(soup.find_all('div', class_ = 'first'))

在这里插入图片描述

(4)选择器定位

I. id 选择器
from bs4 import BeautifulSoup

fp = open('test1.html', encoding = 'utf-8')
soup = BeautifulSoup(fp, 'lxml')

print(soup.select('#first'))

在这里插入图片描述

II. class 选择器
from bs4 import BeautifulSoup

fp = open('test1.html', encoding = 'utf-8')
soup = BeautifulSoup(fp, 'lxml')

print(soup.select('.first'))

在这里插入图片描述

III. 标签选择器
from bs4 import BeautifulSoup

fp = open('test1.html', encoding = 'utf-8')
soup = BeautifulSoup(fp, 'lxml')

print(soup.select('li'))

在这里插入图片描述

IV. 层级选择器
from bs4 import BeautifulSoup

fp = open('test1.html', encoding = 'utf-8')
soup = BeautifulSoup(fp, 'lxml')

print(soup.select('div>ul>#first'))
print(soup.select('div>ul>li'))
print(soup.select('div li'))

在这里插入图片描述

3. 从标签中提取文本内容和属性值

(1)从标签中提取文本内容

from bs4 import BeautifulSoup

fp = open('test1.html', encoding = 'utf-8')
soup = BeautifulSoup(fp, 'lxml')

print(soup.select('.first')[1].string)
print(soup.select('.first')[1].text)

在这里插入图片描述

(2)从标签中提取属性

from bs4 import BeautifulSoup

fp = open('test1.html', encoding = 'utf-8')
soup = BeautifulSoup(fp, 'lxml')

print(soup.find(class_ = 'first')['class'])

在这里插入图片描述

举报

相关推荐

0 条评论