爬虫之urllib模块-CFANZ编程社区

urllib是Python内置的http请求库，用于获取网页内容

urllib.request 请求模块
urllib.error 异常处理模块
urllib.parse URL解析模块

一个简单的get请求

import urllib.request
response=urllib.request.urlopen('http:\\baidu.com') 
print(response.read().decode('utf-8'))

decode() 解码 encode()编码

urlopen返回对象提供方法：

- read() , readline() ,readlines() , fileno() , close() ：这些方法的使用方式与文件对象完全一样

- info()：返回一个httplib.HTTPMessage对象，表示远程服务器返回的头信息

- getcode()：返回Http状态码。如果是http请求，200请求成功完成;404网址未找到

- geturl()：返回请求的url

一个简单的post请求

import urllib.parse 
import urllib.request 
data = bytes(urllib.parse.urlencode({'hello':'world'}),encoding='utf-8') 
reponse = urllib.request.urlopen('http://httpbin.org/post',data=data) 
print(reponse.read())

使用data参数必须使用bytes（字节流）

from urllib import parse
#url='xxx'
a=parse.quote('文字',encoding='gbk')
#url=url+a

超时处理

import urllib.request 
response = urllib.request.urlopen('http://httpbin.org/get',timeout=1) 
print(response.read())

import urllib.request 
import socket import urllib.error 
try:
    response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.01) 
except urllib.error.URLError as e: 
    if isinstance(e.reason,socket.timeout):#判断错误原因 
        print('time out!')

Reques对象

异常处理

URL处理