正也科技-辖区与指标管理系统强化决策支持-CFANZ编程社区

一、场景介绍

我们平常会遇到一些需要根据省、市、区查询信息的网站。

1、省市查询

比如这种，因为全国的省市比较多，手动查询工作量还是不小。

2、接口签名

有时候我们用python直接查询后台接口的话，会发现接口是加签名的。

而签名算法我们是不知道的。

3、Selenium 自动化爬虫

这个时候，就是 python Selenium 自动化爬虫的用武之地了。

它通过分析前端界面元素，模拟用户真实点击的方式，来 请求接口数据。

然后通过分析 界面DOM元素 的方式，来 提取响应数据。

二、环境介绍

python：3.12.5
Edge 浏览器驱动：Edge 浏览器驱动官网
Selenium python 插件
Charles抓包软件（下文会介绍为什么需要）Charles 安装可以看这篇博文

三、步骤

1、下载 Edge 浏览器驱动

首先进入 Edge 浏览器驱动官网

选择 beta(公测) 或者 stable(稳定) 版，根据自身操作系统型号，选择 64位或32位下载。

解压到电脑中某个位置备用。

2、安装 Selenium

pip install selenium

3、安装 Edge-Selenium 工具

pip install msedge-selenium-tools

4、F12 分析前端页面

可以找到省的数据。此时，市的数据界面上并没有显示出来。

不过，通过翻看网站的JS资源，我们找到一个 area.js 这个就是全国区域的基础数据。

进一步分析，我们还知道 parentid 还是省的 id。

5、area.js 数据导入 Excel

把 area.js 数据导入 Excel 通过 JSON数据行转列方式得到Excel 数据

通过 Excel 的数据过滤方式，进一步证实了我们的猜测。

parentid 还是省的下拉控件的 value。

福建的 province 14 过滤出福建的 9 个城市。

6、初步思路

这个时候，我们有了一个用 Selenium 自动化请求的初步思路。

用 Selenium 遍历点击省份控件，接着级联点击城市控件，然后点击查询控件，

最后再用 DOM 方式提取请求响应数据。

7、数据准备

在电脑上，新建一个 city 文件夹，里面存放以省份ID 命名的文件。

文件里面则是城市ID，每个城市ID 占一行。

8、Selenium 方案初版代码

经过上面的数据准备，我们基本可以写出初版的爬虫代码：

from selenium import webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
import time

# 浏览器驱动的存放位置
service = Service(executable_path=r"G:\msedgedriver.exe")
driver = webdriver.Edge(service=service)
driver.get("你的网页")

# 暂停 3 秒  等待网页加载完成
time.sleep(3)

# 找到  省    下拉框元素
province_element = driver.find_element(By.ID, 'province')
# 创建 Select 对象
province_select = Select(province_element)
# 找到  市  下拉框元素
city_element = driver.find_element(By.ID, 'city')
# 创建 Select 对象
city_select = Select(city_element)

# 输入文件
input_path = r"C:\Users\Administrator\Desktop\py\city"
# 输出文件
output_path = r"C:\Users\Administrator\Desktop\py\output.txt"

for  province  in  range(1, 35):
    print('-----------省份【' + str(province) + '】开始')
    # 选中省份
    province_select.select_by_value(str(province))

    # 省份文件
    file_path = input_path + "\\"+str(province)+'.txt'
    with open(file_path, 'r', encoding='utf-8') as input_file:
            for line in input_file:
                city = line.strip()
                print('---------------城市【'+city+'】开始')
                # 选中城市
                city_select.select_by_value(city)
                # 找到 搜索 按钮
                submit_element = driver.find_element(By.ID, 'submit')
                # 点击搜索
                submit_element.click()
                # 暂停 2 秒  等待网页加载完成
                time.sleep(2)
                # 获取无序列表中的所有列表项
                li_elements = driver.find_elements(By.TAG_NAME, 'li')
                # 提取列表项中的文本内容并打印
                for li in li_elements:
                    # 根据 属性或者 class 过滤掉不是我们想要的数据 li
                    # 因为一个界面里面，可能不止一个 列表
                    if None != li.get_attribute("data-index"):
                      with open(output_path, 'a', encoding='utf-8') as output_file:
                          output_file.write('\n' + li.text)

                print('---------------城市【'+city+'】结束')
    print('-----------省份【' + str(province) + '】结束')

# 关闭驱动
driver.quit()

爬到的数据因为涉及信息安全问题，就不在这里展示了。

但是这个代码爬取数据，有一个问题，那就是，

9、Selenium + Charles 方案

这个时候的爬虫代码，就变简单了，只要无脑点击就好。

from selenium import webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
import time

# 浏览器驱动的存放位置
service = Service(executable_path=r"G:\msedgedriver.exe")
driver = webdriver.Edge(service=service)
driver.get("你的网页")

# 暂停 3 秒  等待网页加载完成
time.sleep(3)

# 找到  省    下拉框元素
province_element = driver.find_element(By.ID, 'province')
# 创建 Select 对象
province_select = Select(province_element)
# 找到  市  下拉框元素
city_element = driver.find_element(By.ID, 'city')
# 创建 Select 对象
city_select = Select(city_element)

# 输入文件
input_path = r"C:\Users\Administrator\Desktop\py\city"

for  province  in  range(1, 35):
    print('-----------省份【' + str(province) + '】开始')
    # 选中省份
    province_select.select_by_value(str(province))

    file_path = input_path + "\\"+str(province)+'.txt'
    with open(file_path, 'r', encoding='utf-8') as input_file:
            for line in input_file:
                city = line.strip()
                print('---------------城市【'+city+'】开始')
                # 选中城市
                city_select.select_by_value(city)
                # 找到 搜索 按钮
                submit_element = driver.find_element(By.ID, 'submit')
                # 点击搜索
                submit_element.click()
                # 暂停 2 秒  等待网页加载完成
                time.sleep(2)
                print('---------------城市【'+city+'】结束')
    print('-----------省份【' + str(province) + '】结束')

# 关闭驱动
driver.quit()