利用chatgpt辅助处理html数据-CFANZ编程社区

使用chatgpt进行辅助编程首先要清晰描述自己需求，生成代码通过不断调试及优化，最终完成任务。有它，对程序员是好事也是坏事。

一、确认功能

请编程实现以下功能：

1.所有html保存在all文件夹下。

2.对all文件下的所有html进行处理，提取表格中的数据。

3.自动提取表格头，列名以逗号进行分割。

4.处理完毕的文件输出为out+时间格式文件，用于区别不同时间处理的文件。

二、通过chatgpt来进行编程

利用chatgpt辅助处理html数据_html

源代码如下：

import os
import csv
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup

def extract_data(file):
    with open(file, encoding='utf-8') as f:
        soup = BeautifulSoup(f.read(), 'html.parser')
        table = soup.find('table')

        if not headers:
            ths = table.find_all('th')
            headers.extend([th.text.strip().replace('\n', '') for th in ths])

        for tr in table.find_all('tr')[1:]:
            tds = tr.find_all(['th', 'td'])
            row = []
            for td in tds:
                text = td.text.strip().replace('\n', '').replace(',', ';')
                row.append(text)

            if len(row) == len(headers):
                rows.append(row)

def write_data(output_file):
    with open(output_file, 'w', encoding='utf-8', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(headers)
        writer.writerows(rows)

file_dir = 'all'
html_files = [os.path.join(file_dir, f) for f in os.listdir(file_dir) if f.endswith('.html')]
headers = []
rows = []

for file in html_files:
    extract_data(file)

output_file = f"out_{datetime.now().strftime('%Y-%m-%d_%H%M%S')}.csv"
write_data(output_file)

print(f'Data has been saved to {output_file}')

三、继续优化

1.提升效率

2.提供进度条显示

3.要及时保存数据，处理意外情况发生

最终代码如下：

import os
import csv
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm

def extract_data(file):
    with open(file, encoding='utf-8') as f:
        soup = BeautifulSoup(f.read(), 'html.parser')
        table = soup.find('table')

        if not headers:
            ths = table.find_all('th')
            headers.extend([th.text.strip().replace('\n', '') for th in ths])

        for tr in table.find_all('tr')[1:]:
            tds = tr.find_all(['th', 'td'])
            row = []
            for td in tds:
                text = td.text.strip().replace('\n', '').replace(',', ';')
                row.append(text)

            if len(row) == len(headers):
                rows.append(row)

def write_data(output_file):
    with open(output_file, 'w', encoding='utf-8', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(headers)
        writer.writerows(rows)

file_dir = 'all'
html_files = [os.path.join(file_dir, f) for f in os.listdir(file_dir) if f.endswith('.html')]
current_time = datetime.now().strftime('%Y-%m-%d_%H%M%S')
headers = []
rows = []
output_file = f"out_{current_time}.csv"

# 创建线程池，最大线程数为4
with ThreadPoolExecutor(max_workers=4) as executor:
    # 使用tqdm为整个map添加进度条，可以同时显示当前文件和整体进度
    for _ in tqdm(executor.map(extract_data, html_files), total=len(html_files), desc='Extracting data'):
        pass

write_data(output_file)

print(f'Data has been saved to {output_file}')

执行效果：

利用chatgpt辅助处理html数据_Data_02