Python爬虫爬取部分学校的新闻标题、时间、对应的新闻链接-CFANZ编程社区

使用工具

PyCharm 2018.2.3
Anaconda Navigator
Python 3.6
谷歌浏览器插件：chrome_Xpath_v2.0.2
微云下载链接：https://share.weiyun.com/5iE161Y

准备步骤

1. Anaconda Navigator

打开Anaconda Navigator，创建一个Python 3.6的运行环境
Python爬虫爬取部分学校的新闻标题、时间、对应的新闻链接_chrome

在 Anaconda Navigator中安装对应的Python库
要安装 gevent库、 xlwt库、 etree库、requests库
以下是安装库的方法。

安装成功

如果需要使用命令行进行安装
选择Open Terminal
Python爬虫爬取部分学校的新闻标题、时间、对应的新闻链接_谷歌浏览器_09 安装方法：pip3 install 模块名

Python爬虫爬取部分学校的新闻标题、时间、对应的新闻链接_chrome_10

2. 安装chrome_Xpath_v2.0.2

打开谷歌浏览器找到自定义及控制
Python爬虫爬取部分学校的新闻标题、时间、对应的新闻链接_html_11
选择扩展程序

安装完成

3. 使用chrome_Xpath_v2.0.2

使用F12,打开谷歌浏览器开发者工具
Python爬虫爬取部分学校的新闻标题、时间、对应的新闻链接_chrome_17

选择：Copy XPath

环境搭建

打开PyCharm
Python爬虫爬取部分学校的新闻标题、时间、对应的新闻链接_谷歌浏览器_22

代码示例

#_*_coding:utf-8_*_
# @Author: VVcat
# @Time: 2019/9/27 18:54
# @File: Main.py
# @IDE: PyCharm
# @Email: 
# @Version: 1.0

import&nbsp;gevent
import&nbsp;xlwt&nbsp;as&nbsp;xlwt
from&nbsp;lxml&nbsp;import&nbsp;etree
import&nbsp;requests


def&nbsp;school():
&nbsp;&nbsp;&nbsp;&nbsp;xls&nbsp;=&nbsp;xlwt.Workbook(encoding=&#39;utf-8')  # 创建一个工作簿,括号中为编码方式

&nbsp;&nbsp;&nbsp;&nbsp;# 创建sheet表
&nbsp;&nbsp;&nbsp;&nbsp;# 括号中cell_overwrite_ok=True是为了可以让用户在同一单元格重复写内容，但只保留生效最后一次写入
&nbsp;&nbsp;&nbsp;&nbsp;sheet&nbsp;=&nbsp;xls.add_sheet(&quot;school&quot;,&nbsp;cell_overwrite_ok=True)
&nbsp;&nbsp;&nbsp;&nbsp;row&nbsp;=&nbsp;0

&nbsp;&nbsp;&nbsp;&nbsp;for&nbsp;index&nbsp;in&nbsp;range(1,&nbsp;330):&nbsp;&nbsp;# 页面有329页
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;index&nbsp;==&nbsp;1:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;url&nbsp;=&nbsp;&quot;http://www.zjitc.net/xwzx/xyxw.htm&quot;&nbsp;&nbsp;# 第一页的链接
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;else:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;url&nbsp;=&nbsp;&quot;http://www.zjitc.net/xwzx/xyxw/&quot;&nbsp;+&nbsp;str(index&nbsp;-&nbsp;1)&nbsp;+&nbsp;&quot;.htm&quot;&nbsp;&nbsp;# 第一页之后的链接
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;req&nbsp;=&nbsp;requests.get(url)&nbsp;&nbsp;# 请求获取页面HTML代码
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;resp&nbsp;=&nbsp;req.content.decode(&quot;utf-8&quot;)&nbsp;&nbsp;# 设置页面编码格式为utf-8
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;html&nbsp;=&nbsp;etree.HTML(resp)&nbsp;&nbsp;# 构造了一个XPath解析对象并对HTML文本进行自动修正。
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;uls&nbsp;=&nbsp;html.xpath(&quot;/html/body/div[3]/div[8]/div[2]/div/ul/li/a/div[2]/div&quot;)&nbsp;&nbsp;# 对内容进行定位，获取所有的 内容存放在 list集合里
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;for&nbsp;ul&nbsp;in&nbsp;uls:&nbsp;&nbsp;# 对集合进行遍历
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;title&nbsp;=&nbsp;ul.xpath(&quot;h3&quot;)&nbsp;&nbsp;# 获取标题
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;url&nbsp;=&nbsp;str(title[0].xpath(&quot;../../../@href&quot;)[0])&nbsp;&nbsp;# 获取新闻链接
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;url&nbsp;=&nbsp;url.replace(&quot;../..&quot;,&nbsp;&quot;http://www.zjitc.net&quot;)&nbsp;&nbsp;# 对新闻链接进行分割拼接
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;day&nbsp;=&nbsp;ul.xpath(&quot;../../div[1]/i&quot;)&nbsp;&nbsp;# 获取天数
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;month&nbsp;=&nbsp;ul.xpath(&quot;../../div[1]/em&quot;)&nbsp;&nbsp;# 获取月份
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sheet.write(row,&nbsp;0,&nbsp;title[0].text)&nbsp;&nbsp;# 将标题写入excel
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sheet.write(row,&nbsp;1,&nbsp;month[0].text&nbsp;+&nbsp;day[0].text&nbsp;+&nbsp;&quot;日&quot;)&nbsp;&nbsp;# 将日期写入excel
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sheet.write(row,&nbsp;2,&nbsp;url)&nbsp;&nbsp;# 将链接写入excel
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;row&nbsp;+=&nbsp;1
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;xls.save(&quot;school.xls&quot;)&nbsp;&nbsp;# 为excel文件命名

if&nbsp;__name__&nbsp;==&nbsp;&#39;__main__':
&nbsp;&nbsp;&nbsp;&nbsp;school&nbsp;=&nbsp;gevent.spawn(school)&nbsp;&nbsp;# 将函数放入到python协程中
&nbsp;&nbsp;&nbsp;&nbsp;school.join()&nbsp;&nbsp;# 开启协程