0
点赞
收藏
分享

微信扫一扫

利用chatgpt辅助处理html数据

郝春妮 2023-05-30 阅读 86

     使用chatgpt进行辅助编程首先要清晰描述自己需求,生成代码通过不断调试及优化,最终完成任务。有它,对程序员是好事也是坏事。

一、确认功能

请编程实现以下功能:

1.所有html保存在all文件夹下。

2.对all文件下的所有html进行处理,提取表格中的数据。

3.自动提取表格头,列名以逗号进行分割。

4.处理完毕的文件输出为out+时间格式文件,用于区别不同时间处理的文件。

二、通过chatgpt来进行编程

利用chatgpt辅助处理html数据_html

源代码如下:

import os
import csv
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup

def extract_data(file):
    with open(file, encoding='utf-8') as f:
        soup = BeautifulSoup(f.read(), 'html.parser')
        table = soup.find('table')

        if not headers:
            ths = table.find_all('th')
            headers.extend([th.text.strip().replace('\n', '') for th in ths])

        for tr in table.find_all('tr')[1:]:
            tds = tr.find_all(['th', 'td'])
            row = []
            for td in tds:
                text = td.text.strip().replace('\n', '').replace(',', ';')
                row.append(text)

            if len(row) == len(headers):
                rows.append(row)

def write_data(output_file):
    with open(output_file, 'w', encoding='utf-8', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(headers)
        writer.writerows(rows)

file_dir = 'all'
html_files = [os.path.join(file_dir, f) for f in os.listdir(file_dir) if f.endswith('.html')]
headers = []
rows = []

for file in html_files:
    extract_data(file)

output_file = f"out_{datetime.now().strftime('%Y-%m-%d_%H%M%S')}.csv"
write_data(output_file)

print(f'Data has been saved to {output_file}')

三、继续优化

1.提升效率

2.提供进度条显示

3.要及时保存数据,处理意外情况发生

最终代码如下:

import os
import csv
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm

def extract_data(file):
    with open(file, encoding='utf-8') as f:
        soup = BeautifulSoup(f.read(), 'html.parser')
        table = soup.find('table')

        if not headers:
            ths = table.find_all('th')
            headers.extend([th.text.strip().replace('\n', '') for th in ths])

        for tr in table.find_all('tr')[1:]:
            tds = tr.find_all(['th', 'td'])
            row = []
            for td in tds:
                text = td.text.strip().replace('\n', '').replace(',', ';')
                row.append(text)

            if len(row) == len(headers):
                rows.append(row)

def write_data(output_file):
    with open(output_file, 'w', encoding='utf-8', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(headers)
        writer.writerows(rows)

file_dir = 'all'
html_files = [os.path.join(file_dir, f) for f in os.listdir(file_dir) if f.endswith('.html')]
current_time = datetime.now().strftime('%Y-%m-%d_%H%M%S')
headers = []
rows = []
output_file = f"out_{current_time}.csv"

# 创建线程池,最大线程数为4
with ThreadPoolExecutor(max_workers=4) as executor:
    # 使用tqdm为整个map添加进度条,可以同时显示当前文件和整体进度
    for _ in tqdm(executor.map(extract_data, html_files), total=len(html_files), desc='Extracting data'):
        pass

write_data(output_file)

print(f'Data has been saved to {output_file}')

执行效果:

利用chatgpt辅助处理html数据_Data_02

举报

相关推荐

0 条评论