Python爬虫：清华大学新闻爬虫的实现-CFANZ编程社区

最近往python爬虫这块研究了一下，不禁被python的简洁和强大震撼到了，下面给大家介绍一下我用python3.12做的爬虫，我将会使用的库包括：requests,BeautifulSoup,time,re,jieba。

详细步骤：

1.爬取每个新闻对应的URL：

该程序的爬取时间比较长，你也可以根据需求适当减少,

程序如下：

#导入必要的库
import requests
from bs4 import BeautifulSoup
import time
raw='https://www.tsinghua.edu.cn/info/1177/'
#创建用于保存URL的文件
with open('d:\\清华url.txt','w+') as f:
    urls=[]
    #设置伪造请求头
    headers = {
        'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)'
    }
    for i in range(109542,102140-1):
        #生成url
        url=f"{raw}{str(i)}.htm"
        try:
            response = requests.head(url)
            #抓取URL
            r = requests.get(url, timeout=30, headers=headers)
            time.sleep(0.5)#等待页面加载完毕
            r.raise_for_status()#检测状态
            r.encoding=r.apparent_encoding
            print(f'{url}的外部抓取已完成，开始处理----------')
        except requests.RequestException as e:#具体的有关请求失败的异常
            print('请求错误:', e)
        except Exception as e:
            print('发生错误:', e)
        soup=BeautifulSoup(r.text,'html.parser')#解析器也可更换为’lxml‘,但需要安装
        #搜索目标'div'标签
        al=soup.find_all('p',class_='vsbcontent_start')
        for a in al:
            html=a.get('div')#基本的get请求
            temp=url+'\n'
            #防止重复
            if temp not in urls:
                urls.append(temp)
                f.write(temp)
                print(f'{url}已结束爬取')
    print(f"\n总共爬取了{len(urls)}个url,")
    print("url被载入D盘中'清华url'了!")
f.close()

我这里的请求头是固定的，但大家在实操时先设置一个包含较多请求头的列表，再随机选择会更加安全哦~

程序进程结束后，会在D：\\下出现‘清华url.txt’文件，打开后是这样的：

Python爬虫：清华大学新闻爬虫的实现_jieba

2.读取新闻内容

这一步的目的如标题所说：

代码如下：

import requests
from bs4 import BeautifulSoup
count = 0   #计数器
with open('d:\\清华url.txt', 'r') as f:
    for line in f.readlines():
        line = line.strip()
        count += 1
        try:
            r = requests.get(line, timeout=20)
            r.raise_for_status()
            r.encoding = r.apparent_encoding
        except:
            print('Error')
            continue
        soup = BeautifulSoup(r.text, 'html.parser')
        print('开始提取文本，稍等-------loading-------')
        s = soup.find_all('p')
        with open('d:\\清华新闻.txt', 'a+', encoding='utf-8') as c:
            for i in s:
                print(f"第{count}篇获取成功!-------正在过滤信息,请稍等-------")
                c.write(i.get_text())
print('完成')

3.正则表达式删除数字

这一步就比较复杂了，语法也比较难懂，有兴趣的同学可以深入钻研一下。

import re

# 打开文件并读取内容
with open('d:\\清华新闻.txt', 'r', encoding='utf-8') as f:
    content = f.read()

# 使用正则表达式删除数字
content = re.sub(r'\d+', '', content)

# 将处理后的内容写回文件
with open('d:\\清华新闻.txt', 'w', encoding='utf-8') as f:
    f.write(content)

4.jieba分词

在这一步骤中，我们将对文本进行分词处理，并最终筛选出百大高频词。

import jieba
# 读取文件内容
txt_content = open('d:\\清华新闻.txt', encoding='gbk').read()
# 停用词列表
stopwords = [line.strip() for line in open('d:\\停用词表.txt', encoding='gbk').readlines()]
# 使用jieba进行分词
words = jieba.lcut(txt_content)
# 统计词频
counts = {}
for word in words:
    if word not in stopwords:
        if word!='年月日':
            if len(word) == 1:
                continue
            else:
                counts[word] = counts.get(word, 0) + 1
# 排序并打印结果
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(100):
    word, count = items[i]
    print('{:<10}{:>7}'.format(word, count))