BeautifulSoup解析网页-CFANZ编程社区

BeautifulSoup解析网页

提取对象

遍历文档树

BeautifulSoup解析网页_firefox

获取第一个acticle 的标题

soup.article.a.div.h4.text

import requests
from bs4 import BeautifulSoup
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'
}
html=requests.get(headers=headers).text
soup=BeautifulSoup(html,'lxml')
print(soup.article.a.div.h4.text)

BeautifulSoup解析网页_firefox_02

搜索文档树

常用方法

find() 获取一个

find_all() 获取所有

实现使用find_all() 获取所有article标签

然后遍历使用find() 获取文章标题

import requests
from bs4 import BeautifulSoup
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'
}
html=requests.get(,headers=headers).text
    soup=BeautifulSoup(html,'lxml')
titles=soup.find_all('article')
for i in titles:
    print((i.find('h4')).text)

BeautifulSoup解析网页_python_03

CSS选择器

import requests
from bs4 import BeautifulSoup
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'
}
html=requests.get,headers=headers).text
soup=BeautifulSoup(html,'lxml')
name=soup.select("article h4")
for i in name:
    print(i.text)

BeautifulSoup解析网页_python_04

获取文章链接

import requests
from bs4 import BeautifulSoup
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'
}
html=requests.get(,headers=headers).text
soup=BeautifulSoup(html,'lxml')
name=soup.select("article a")
for i in name:
    print(i['href'])

BeautifulSoup解析网页_xml_05

BeautifulSoup解析网页_html_06

0 条评论