BeautifulSoup解析网页
提取对象
遍历文档树
获取第一个acticle 的标题
soup.article.a.div.h4.text
import requests
from bs4 import BeautifulSoup
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'
}
html=requests.get(headers=headers).text
soup=BeautifulSoup(html,'lxml')
print(soup.article.a.div.h4.text)
搜索文档树
常用方法
find() 获取一个
find_all() 获取所有
实现使用find_all() 获取所有article标签
然后遍历 使用find() 获取文章标题
import requests
from bs4 import BeautifulSoup
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'
}
html=requests.get(,headers=headers).text
soup=BeautifulSoup(html,'lxml')
titles=soup.find_all('article')
for i in titles:
print((i.find('h4')).text)
CSS选择器
import requests
from bs4 import BeautifulSoup
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'
}
html=requests.get,headers=headers).text
soup=BeautifulSoup(html,'lxml')
name=soup.select("article h4")
for i in name:
print(i.text)
获取文章链接
import requests
from bs4 import BeautifulSoup
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'
}
html=requests.get(,headers=headers).text
soup=BeautifulSoup(html,'lxml')
name=soup.select("article a")
for i in name:
print(i['href'])