Python 网络爬虫实战指南：从入门到精通-CFANZ编程社区

一、前言

在信息爆炸的时代，获取和整理有效数据成了许多行业的核心需求。无论是产品分析、竞品监控，还是数据挖掘和舆情研究，Python 网络爬虫都提供了一种高效且自动化的解决方案。

本文将系统介绍 Python 爬虫的基础知识、关键技术以及进阶实战，包括如何构建一个完整的数据抓取系统，如何绕过反爬机制，以及数据清洗与存储等常见问题。

二、网络爬虫基础知识

1. 什么是网络爬虫？

网络爬虫（Web Crawler）是通过程序自动访问网站并抓取网页内容的工具。其核心功能包括：

模拟浏览器请求网页
解析网页结构，提取有用信息
保存结果至本地或数据库

2. HTTP 请求基础

常见的请求方法有：

GET：获取资源（网页、图片等）
POST：提交数据（登录表单等）

import requests

url = 'https://example.com'
response = requests.get(url)
print(response.status_code)
print(response.text)

3. 常见 HTTP 状态码

200：成功
301/302：重定向
403：禁止访问（常见反爬）
404：页面不存在
500：服务器错误

三、网页数据解析

1. BeautifulSoup：快速入门

from bs4 import BeautifulSoup

html = '<html><body><h1>Hello</h1></body></html>'
soup = BeautifulSoup(html, 'html.parser')
print(soup.h1.text)

2. 提取页面元素

soup = BeautifulSoup(response.text, 'html.parser')
titles = soup.find_all('h2', class_='title')
for title in titles:
    print(title.text)

3. lxml：更高效的解析器（支持 XPath）

from lxml import etree

html = etree.HTML(response.text)
titles = html.xpath('//div[@class="post"]/h2/text()')
print(titles)

四、实战一：爬取豆瓣电影 Top 250

import requests
from bs4 import BeautifulSoup

for page in range(0, 250, 25):
    url = f'https://movie.douban.com/top250?start={page}'
    headers = {'User-Agent': 'Mozilla/5.0'}
    res = requests.get(url, headers=headers)
    soup = BeautifulSoup(res.text, 'html.parser')
    for item in soup.find_all('div', class_='item'):
        title = item.find('span', class_='title').text
        rating = item.find('span', class_='rating_num').text
        print(title, rating)

五、实战二：爬取知乎热榜标题和链接

import requests
from bs4 import BeautifulSoup

url = 'https://www.zhihu.com/billboard'
headers = {'User-Agent': 'Mozilla/5.0'}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.text, 'html.parser')

for item in soup.find_all('a', class_='HotList-item'):
    title = item.text.strip()
    link = item['href']
    print(title, link)

六、处理反爬机制

1. 添加请求头

headers = {
    'User-Agent': 'Mozilla/5.0',
    'Referer': 'https://www.baidu.com'
}

2. 设置代理（可规避 IP 限制）

proxies = {
    'http': 'http://123.456.78.9:8080',
    'https': 'http://123.456.78.9:8080'
}
res = requests.get(url, headers=headers, proxies=proxies)

3. 使用动态 UA、IP 轮换库（如 fake_useragent、scrapy-rotating-proxies）

七、动态网页抓取（Ajax 与 JavaScript 渲染）

1. 使用 Selenium 模拟浏览器行为

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://weibo.com')
html = driver.page_source
driver.quit()

2. 控制元素点击与滚动加载

from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

search = driver.find_element(By.NAME, 'q')
search.send_keys('Python')
search.send_keys(Keys.RETURN)

八、数据清洗与存储

1. 使用 pandas 清洗数据

import pandas as pd

df = pd.DataFrame(data)
df.dropna(inplace=True)
df.to_csv('清洗后的数据.csv', index=False)

2. 存入数据库（以 SQLite 为例）

import sqlite3

conn = sqlite3.connect('data.db')
df.to_sql('movies', conn, if_exists='replace', index=False)

九、多线程与异步爬虫

1. 使用 threading 提高效率

import threading

def fetch(url):
    res = requests.get(url)
    print(res.status_code)

urls = ['https://example.com/page1', 'https://example.com/page2']
threads = [threading.Thread(target=fetch, args=(u,)) for u in urls]

for t in threads:
    t.start()
for t in threads:
    t.join()

2. 异步爬虫（aiohttp + asyncio）

import aiohttp
import asyncio

async def fetch(session, url):
    async with session.get(url) as res:
        print(await res.text())

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, f'https://example.com/{i}') for i in range(10)]
        await asyncio.gather(*tasks)

asyncio.run(main())

十、构建一个完整的爬虫项目（简要步骤）

项目结构示例：

my_spider/
├── main.py
├── spider.py
├── parser.py
├── save.py
└── config.py

十一、总结

Python 网络爬虫是自动化数据获取的重要工具，其应用已深入多个行业。从基础的 requests + BeautifulSoup 到复杂的 Selenium + 异步 + 反爬策略，爬虫开发既是技术挑战，也是数据竞争力的体现。

掌握爬虫能力后，你可以：

监控电商/房产/招聘等平台价格与信息
自动抓取论坛、新闻站点、微博热搜等热点数据
构建自己的数据集，用于分析、可视化甚至训练 AI 模型

Python 网络爬虫实战指南：从入门到精通