[自动化] [PyChromeDevTools实战] 01 - 爬取Chrome已经打开的网页中的所有文件-CFANZ编程社区

文章目录

导读

在这里插入图片描述

开发环境

	版本号	描述
操作系统	Win10-1607
Google Chrome	96.0.4664.110 (正式版本) （64 位） (cohort: 97_Win_99)
Python(venv)	Python3.8.6(virtualenv)
PyChromeDevTools	0.4

核心原理

准备工作

按键Win+R，打开运行对话框
执行命令"C:\Program Files\Google\Chrome\Application\chrome.exe" "https://www.baidu.com" --remote-debugging-port=9992 --headless

原理可以参考文章 PyChromeDevTools源码分析。

通过浏览器打开网页http://localhost:9992/，点击百度页面，就可以查看到无头浏览器的内容了。
在这里插入图片描述

chrome.Page.getResourceTree()

该命令是获取当前连接（本例子中只开启了baidu一个页面，PyChromeDevTools.ChromeInterface(port=9992)默认会连接该页面）的Page中的所有内容，结构如下：
在这里插入图片描述

chrome.Page.getResourceContent()

解析getResourceTree后，拿到所有资源信息，然后解析每个资源信息的内容。

指令Page.getResourceContent，包含两个参数：

frameId：上图中标记的id
url：上图中resources的每个项目

协议内容示例如下，content需要根据base64Encoded来判断是否需要base64解密：
在这里插入图片描述

chrome.Page.enable()

编写过程中调用指令Page.getResourceContent，出现错误{"error":{"code":-32000,"message":"Agent is not enabled."},"id":1}。术语Agent（代理）有点误导，协议文档谈到需要启用才能调试它们的域，也就是调用该指令前，需要执行chrome.Page.enable()指令。

源码

import base64
import os
import urllib

import PyChromeDevTools


def ez_get_object(j, list_keys):
    """
    快速获取json转换后的对象子元素
    j = {
        'a':
            {'b':{
                b1: {'name':'xiao'},
                b2: {'name':'2b'}
            }}
    }

    >>> ez_get_object(j, 'a,b,b2')
    {'name':'2b'}

    j: json转换后的对象
    list_keys： 逗号分隔的字符串
    """
    ret = j
    for k in list_keys.split(','):
        ret = ret.get(k.strip())

    return ret


def 下载网页所有资源_unit_content(url, content):
    url_parsed = urllib.parse.urlparse(url)
    full_path = r'G:/_TMP/_web_resources/{}/'.format(url_parsed.hostname) + url_parsed.path

    # 已经下载过的，直接返回
    if os.path.isfile(full_path):
        return

    dirname = os.path.dirname(full_path)
    # basename = os.path.basename(full_path)
    print('\t\tfull_path', full_path)

    if not os.path.isdir(dirname):
        os.makedirs(dirname)

    with open(full_path, 'wb') as wf:
        return wf.write(content)


def 下载网页所有资源_unit(url, _type, unit):
    if _type in ['Script', 'Stylesheet', 'Image']:
        if unit.get('base64Encoded'):
            content = base64.b64decode(unit.get('content'))
        else:
            content = unit.get('content').encode('utf-8')

        下载网页所有资源_unit_content(url, content)
    else:
        print('\t[Warning] 未处理的_type ', _type)


def 下载网页所有资源():
    chrome = PyChromeDevTools.ChromeInterface(port=9992)

    chrome.Page.enable()
    result, messages = chrome.Page.getResourceTree()
    frame_id = ez_get_object(result, 'result,frameTree,frame,id')
    resources = ez_get_object(result, 'result,frameTree,resources')

    for resource in resources:
        url = resource.get('url')
        _type = resource.get('type')
        result_, messages_ = chrome.Page.getResourceContent(frameId=frame_id, url=url)
        print('chrome.Page.getResourceContent', _type, url, result_)
        if result_:
            下载网页所有资源_unit(url, _type, result_.get('result'))
        else:
            print('[Warning] chrome.Page.getResourceContent result_ is False')

参考资料

错误原因：Agent is not enabled https://stackoverflow.com/questions/38693379/how-to-get-webpage-resource-content-via-chrome-remote-debugging
[自动化] PyChromeDevTools源码分析 https://blog.csdn.net/kinghzking/article/details/122650766
qq群：夜猫逐梦技术交流裙/953949723