0
点赞
收藏
分享

微信扫一扫

Python—爬取小说案例(bs4方法)

大南瓜鸭 2022-02-27 阅读 99

一、源码:

import requests
from bs4 import  BeautifulSoup
#UA伪装
headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36'
}
url='这里自己输入你的url(因版权原因)'
page_text=requests.get(url=url,headers=headers).content

# 1.实例化一个BeautifulSoup对象,并且将页面源码加载到该对象中
soup=BeautifulSoup(page_text,'lxml')

# 2.通过调用BeautifulSoup对象相关的属性或者方法进行标签定位和数据提取
#解析章节标题和详情页的url
li_list=soup.select('.sidamingzhu-list-mulu>ul>li')

fp=open('./sanguo.txt','w',encoding='utf-8')
for li in li_list:
    #章节标题
    title=li.a.string
    #章节url
    detail_url=li.a['href']
    #对详情页发起请求,解析章节内容
    detail_page_text=requests.get(url=detail_url,headers=headers).content
    detail_soup=BeautifulSoup(detail_page_text,'lxml')
    div_tag=detail_soup.find('div',class_='grap')
    #解析到章节内容
    content=div_tag.text
    fp.write(title+':'+content)
    print(title,'爬取成功!!!')

二、示例图片

三、bs4数据解析过程:

01、章节标题

代码解析数据:

#解析章节标题和详情页的url
li_list=soup.select('.sidamingzhu-list-mulu>ul>li')


#获取逐个章节标题与url
for li in li_list:
    #章节标题
    title=li.a.string
    #章节url
    detail_url=li.a['href']

 

02、章节内容

 代码解析数据:

#对详情页发起请求
detail_page_text=requests.get(url=detail_url,headers=headers).content

#实例化详情页对象
detail_soup=BeautifulSoup(detail_page_text,'lxml')

#解析章节内容
div_tag=detail_soup.find('div',class_='grap')

#解析到章节内容,写入文件
content=div_tag.text
fp.write(title+':'+content)
举报

相关推荐

0 条评论