关于关于bs4库中的BeautifulSoup模块在实战后的快速上手小结
一、BeautifulSoup 模块
1.将 Beautiful 对象实例化的两种方法
(1)将本地 HTML 文档转为 BeautifulSoup 对象
from bs4 import BeautifulSoup
fp = open('test1.html', encoding = 'utf-8')
soup = BeautifulSoup(fp, 'lxml')
print(soup)
(2)将爬到的网页源代码转为 BeautifulSoup 对象
from bs4 import BeautifulSoup
import requests
url = 'https://www.baidu.com'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
response = requests.get(url = url, headers = headers ).text
soup = BeautifulSoup(response, 'lxml')
print(soup)
2. BeautifulSoup 标签定位
(1)标签名定位
from bs4 import BeautifulSoup
fp = open('test1.html', encoding = 'utf-8') # 读取Html文档
soup = BeautifulSoup(fp, 'lxml') # 实例化对象
print(soup.p)
(2)标签属性定位
from bs4 import BeautifulSoup
fp = open('test1.html', encoding = 'utf-8')
soup = BeautifulSoup(fp, 'lxml')
print(soup.find(class_ = 'first'))
print(soup.find_all(class_ = 'first'))
(3)标签+属性定位
from bs4 import BeautifulSoup
fp = open('test1.html', encoding = 'utf-8')
soup = BeautifulSoup(fp, 'lxml')
print(soup.find('div', class_ = 'first'))
print(soup.find_all('div', class_ = 'first'))
(4)选择器定位
I. id 选择器
from bs4 import BeautifulSoup
fp = open('test1.html', encoding = 'utf-8')
soup = BeautifulSoup(fp, 'lxml')
print(soup.select('#first'))
II. class 选择器
from bs4 import BeautifulSoup
fp = open('test1.html', encoding = 'utf-8')
soup = BeautifulSoup(fp, 'lxml')
print(soup.select('.first'))
III. 标签选择器
from bs4 import BeautifulSoup
fp = open('test1.html', encoding = 'utf-8')
soup = BeautifulSoup(fp, 'lxml')
print(soup.select('li'))
IV. 层级选择器
from bs4 import BeautifulSoup
fp = open('test1.html', encoding = 'utf-8')
soup = BeautifulSoup(fp, 'lxml')
print(soup.select('div>ul>#first'))
print(soup.select('div>ul>li'))
print(soup.select('div li'))
3. 从标签中提取文本内容和属性值
(1)从标签中提取文本内容
from bs4 import BeautifulSoup
fp = open('test1.html', encoding = 'utf-8')
soup = BeautifulSoup(fp, 'lxml')
print(soup.select('.first')[1].string)
print(soup.select('.first')[1].text)
(2)从标签中提取属性
from bs4 import BeautifulSoup
fp = open('test1.html', encoding = 'utf-8')
soup = BeautifulSoup(fp, 'lxml')
print(soup.find(class_ = 'first')['class'])