nosql数据库 redis-CFANZ编程社区

一，爬虫概述

网络爬虫，顾名思义，它是一种顺着url爬取网页数据的自动化程序或者脚本。可以认为地，我们给予爬虫一个网站的url,它就会返回给我们网站的源代码，我们通过正则表达式来筛选我们需要的内容数据，这就是爬虫的目的，而所谓的反爬和反反爬策略只是这个过程的障碍与应对。

from urllib.request import urlopen

url="http://www.baidu.com"

response = urlopen(url)

print(response.read().decode("utf-8"))#拿到的是页面源代码

二，Web请求过程

客户端向服务端发送请求，服务端接收到请求后进行判断，允许后进行html拼接，然后返回给客户端，客户端浏览器对html文件进行渲染形成我们所看见的页面。

实际情况中，网站通常会采用分布式，即将html文件和数据分开来返回给客户端，这样可以把服务器的压力分摊，当访问人数过多时不至于服务器宕机。

三，HTTP协议

HTTP协议（Hyper Text Transfer Protocol超文本传输协议），是用于从万维网传输超文本到本地浏览器的传送协议。

四，Requests模块

安装requests模块：

Import requests

#爬取百度源代码
url="http://www.baidu.com"

res=requests.get(url)#[Response200]200是状态码，没问题
print(res.content.decode('utf-8'))#拿到源代码



Import requests

content=input("输入你要检索的内容")

url=f"https://www.sogou.com/web?query={content}"

response=requests.get(url)

print(response.text)

importrequests

content=input("输入你要检索的内容")
headers=
{"User-Agent":"Mozilla/5.0(WindowsNT10.0;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/124.0.0.0Safari/537.36Edg/124.0.0.0"}

url=f"https://www.sogou.com/web?query={content}"

response=requests.get(url,headers=headers)

print(response.text)

五，Post请求

我们打开百度翻译，调成英文输入法，打开network观察XHR(ajax请求)

import requests
import json
headers={"User-Agent":"Mozilla/5.0(WindowsNT10.0;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/124.0.0.0Safari/537.36Edg/124.0.0.0"}

url="https://fanyi.baidu.com/sug"

data={
"kw":input("请输入一个单词")
}
response=requests.post(url,data=data)
response=json.loads(response.text)
print(response)