爬虫基础_httpx

为什么要使用httpx

requests模块不支持http2.0协议, 在访问使用http2.0协议的网站时, 就需要用到httpx

# 使用requests模块访问http2.0的网站, 会报错
import requests

url = 'https://spa16.scrape.center/'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'
}

resp = requests.get(url=url, headers=headers)
print(resp.text)
"""
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))
"""

http协议版本可以在抓包工具中查看(浏览器开发者工具的network页面)

安装

pip install httpx[http2]  # 后面要加[http2]否则不能支持https2.0

功能及用法

httpx的绝大多数API与requests相同, 一些区别和独特的用法如下:

开启对http2.0的支持

# 默认情况下, httpx没有开启对http2.0的支持
# 要开启对http2.0的支持, 需要这样
import httpx

url = 'https://spa16.scrape.center/'

# 参数http2设置为True
# 这里Client对象的作用类似于requests的Session对象
with httpx.Client(http2=True) as client:
    resp = client.get(url)
    print(resp.text)

查看http协议版本

import httpx

url = 'https://spa16.scrape.center/'

with httpx.Client(http2=True) as client:
    resp = client.get(url)
    # 响应对象的http_version属性是所使用的协议版本
    print(resp.http_version)  # HTTP/2

支持异步

import httpx
import asyncio


async def scrape_main(url):
    # 使用AsyncClient支持异步
    async with httpx.AsyncClient(http2=True) as client:
        resp = await client.get(url)
        print(resp.text)

if __name__ == '__main__':
    main_url = 'https://spa16.scrape.center/'
    asyncio.run(scrape_main(main_url))

0 条评论