0
点赞
收藏
分享

微信扫一扫

树结构转List

年迈的代码机器 2023-08-22 阅读 70

目录

1 urlib 库

2 Beautiful Soup库

3 使用代理

3.1 代理种类 HTTP、HTTPS 和 SOCKS5

3.2 使用 urllib 和 requests 库使用代理

3.3 案例:自建代理池

4 实战 提取视频信息并进行分析


1 urlib 库

常用语法:

  1. 发送GET请求:
import urllib.request

url = "https://www.example.com"
response = urllib.request.urlopen(url)
content = response.read().decode("utf-8")
print(content)

2 发送POST请求:

import urllib.request

url = "https://www.example.com"
response = urllib.request.urlopen(url)
content = response.read().decode("utf-8")
print(content)

3 实战示例:

爬取网页内容:

import urllib.request

url = "https://www.example.com"
response = urllib.request.urlopen(url)
content = response.read().decode("utf-8")
print(content)

下载文件:

import urllib.request

url = "https://www.example.com/sample.pdf"
urllib.request.urlretrieve(url, "sample.pdf")
print("File downloaded.")

处理异常:

import urllib.error

try:
    response = urllib.request.urlopen("https://www.nonexistent-website.com")
except urllib.error.URLError as e:
    print("Error:", e)

解析URL:

import urllib.parse

url = "https://www.example.com/page?param1=value1&param2=value2"
parsed_url = urllib.parse.urlparse(url)
print(parsed_url.scheme)  # 输出协议部分
print(parsed_url.netloc)  # 输出域名部分
print(parsed_url.query)   # 输出查询参数部分

4 Handler 处理器和自定义 Opener:

处理器(Handler)允许你自定义请求的处理方式,以满足特定的需求。urllib.request 模块提供了一些默认的处理器,例如 HTTPHandler 和 HTTPSHandler,用于处理 HTTP 和 HTTPS 请求。你还可以通过创建自定义的 Opener 来组合不同的处理器,实现更灵活的请求配置。

自定义 Opener 示例:

import urllib.request

# 创建自定义 Opener,组合不同的处理器
opener = urllib.request.build_opener(urllib.request.HTTPSHandler())

# 使用自定义 Opener 发送请求
response = opener.open("https://www.example.com")
content = response.read().decode("utf-8")
print(content)

5 URLError 和 HTTPError

URLError 示例:

import urllib.error

try:
    response = urllib.request.urlopen("https://www.nonexistent-website.com")
except urllib.error.URLError as e:
    print("URLError:", e)

HTTPError 示例:

import urllib.error

try:
    response = urllib.request.urlopen("https://www.example.com/nonexistent-page")
except urllib.error.HTTPError as e:
    print("HTTPError:", e.code, e.reason)

2 Beautiful Soup库

        Beautiful Soup 是一个强大的Python库,用于解析HTML和XML文档,提取其中的数据。以下是一些 Beautiful Soup 常用的语法和方法:

from bs4 import BeautifulSoup

# HTML 示例
html = """
<html>
<head>
<title>Sample HTML</title>
</head>
<body>
<p class="intro">Hello, Beautiful Soup</p>
<p>Another paragraph</p>
<a href="https://www.example.com">Example</a>
</body>
</html>
"""

# 创建 Beautiful Soup 对象
soup = BeautifulSoup(html, "html.parser")

# 节点选择器
intro_paragraph = soup.p
print("Intro Paragraph:", intro_paragraph)

# 方法选择器
another_paragraph = soup.find("p")
print("Another Paragraph:", another_paragraph)

# CSS 选择器
link = soup.select_one("a")
print("Link:", link)

# 获取节点信息
text = intro_paragraph.get_text()
print("Text:", text)

# 获取节点的属性值
link_href = link["href"]
print("Link Href:", link_href)

# 遍历文档树
for paragraph in soup.find_all("p"):
    print(paragraph.get_text())

# 获取父节点
parent = intro_paragraph.parent
print("Parent:", parent)

# 获取兄弟节点
sibling = intro_paragraph.find_next_sibling()
print("Next Sibling:", sibling)

# 使用 CSS 选择器选择多个节点
selected_tags = soup.select("p.intro, a")
for tag in selected_tags:
    print("Selected Tag:", tag)

# 修改节点文本内容
intro_paragraph.string = "Modified Text"
print("Modified Paragraph:", intro_paragraph)

# 添加新节点
new_paragraph = soup.new_tag("p")
new_paragraph.string = "New Paragraph"
soup.body.append(new_paragraph)

# 移除节点
link.extract()
print("Link Extracted:", link)

3 使用代理

3.1 代理种类 HTTP、HTTPS 和 SOCKS5

3.2 使用 urllibrequests 库使用代理

urllib:

import urllib.request

proxy_handler = urllib.request.ProxyHandler({'http': 'http://proxy.example.com:8080'})
opener = urllib.request.build_opener(proxy_handler)
response = opener.open('https://www.example.com')

requests:

import requests

proxies = {'http': 'http://proxy.example.com:8080'}
response = requests.get('https://www.example.com', proxies=proxies)

3.3 案例:自建代理池

import requests
from bs4 import BeautifulSoup
import random

# 获取代理IP列表
def get_proxies():
    proxy_url = "https://www.example.com/proxy-list"
    response = requests.get(proxy_url)
    soup = BeautifulSoup(response.text, "html.parser")
    proxies = [proxy.text for proxy in soup.select(".proxy")]
    return proxies

# 从代理池中随机选择一个代理
def get_random_proxy(proxies):
    return random.choice(proxies)

# 使用代理发送请求
def send_request_with_proxy(url, proxy):
    proxies = {'http': proxy, 'https': proxy}
    response = requests.get(url, proxies=proxies)
    return response.text

if __name__ == "__main__":
    proxy_list = get_proxies()
    random_proxy = get_random_proxy(proxy_list)
    
    target_url = "https://www.example.com"
    response_content = send_request_with_proxy(target_url, random_proxy)
    print(response_content)

4 实战 提取视频信息并进行分析

import urllib.request
from bs4 import BeautifulSoup

# 定义目标网页的 URL
url = 'https://www.example.com/videos'

# 定义代理(如果需要使用代理)
proxies = {'http': 'http://proxy.example.com:8080'}

# 发起请求,使用代理
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
response = urllib.request.urlopen(req, proxies=proxies)

# 解析网页内容
soup = BeautifulSoup(response, 'html.parser')

# 创建一个空的视频列表
videos = []

# 获取视频信息
video_elements = soup.find_all('div', class_='video')
for video_element in video_elements:
    title = video_element.find('h2').text
    video_link = video_element.find('a', class_='video-link')['href']
    videos.append({'title': title, 'video_link': video_link})

# 输出提取到的视频信息
for video in videos:
    print(f"Title: {video['title']}")
    print(f"Video Link: {video['video_link']}")
    print()

# 对视频信息进行分析
num_videos = len(videos)
print(f"Total Videos: {num_videos}")
举报

相关推荐

0 条评论