图片中文字的识别-CFANZ编程社区

图片中文字的识别

主要用到tesseract这个工具，以及pytesseract和pillow两个库。

首先下载tesseract这个工具，下载地址：

Index of /tesseracthttps://digi.bib.uni-mannheim.de/tesseract/

根据自己电脑系统进行下载，下载后安装，安装的路径一定要设为环境变量。

然后装一下pytesseract和pillow两个库，通过pip就可以了。

ps，如果需要显示中文，可以在GitHub中下载中文语言包，地址为GitHub - tesseract-ocr/tessdata: Trained models with support for legacy and LSTM OCR engine

chi_sim是简体中文，下好之后，放入刚才tesseract安装目录下的tessdata文件夹中。

下面进入图片文字识别，以10086信息为例：

编写程序：

import pytesseract
import re 
from PIL import Image

pic = Image.open('12.png')   # 读取图片
s = pytesseract.image_to_string(pic, lang='chi_sim')  # 识别图中文字
f = s.replace('\n','').replace(' ', '')  #替换换行和空行
print(f)

识别结果：

与图片对比，识别还可以，但有些图片识别不出。

0 条评论