0
点赞
收藏
分享

微信扫一扫

MacOS python从图片中提取文字

you的日常 2022-02-04 阅读 68

环境:MacOS 10.14.6 python3.10。上一篇已经安装了python3.10。

本篇目标是从图片中提取文字:建立文件convertp2t.py。

from PIL import Image
import pytesseract
import os
import pdb

image_address = input(r'请输入图片地址:')
if not os.path.exists(r'{}'.format(image_address)):
        print(r'{} no exist'.format(image_address))

print(r'file name:{}'.format(image_address));
image = Image.open(r'{}'.format(image_address), 'r') #打开图片

#pdb.set_trace()
text = pytesseract.image_to_string(image, lang='chi_sim') #图片转文字
print(text) #打印结果

开始调试之路。

先后报错:

tesseract is not installed or it's not in your path

pytesseract.pytesseract.TesseractError: (2, 'Usage: pytesseract [-l lang] in

需要安装tesseract,用pip3 install tesseract,还不行。

修改/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytesseract/pytesseract.py,把tesseract_cmd路径改为tesseract所在路径还不行。

于是下载tesseract源码重新安装:

git clone https://github.com/tesseract-ocr/tesseract.git
cd tesseract
./autogen

中间遇到一些错误。

1、m4、autoconf、automake、libtool没有安装。

于是上mirrors.kernel.org/gnu下载m4、autoconf、automake,我没有都用最新版本,主要考虑语法可能有变化。m4用了1.4.6、autoconf用2.69、automake用1.16、libtool用2.4.6。

分别编译后安装。

tar xfz m4-1.4.6.tar.gz 
cd m4-1.4.6
./configure --prefix=/usr/local
make
sudo make install

cd ../
tar xfz autoconf-2.69.tar.gz
cd autoconf-2.69
./configure --prefix=/usr/local
make 
sudo make install

cd ../
tar xfz automake-1.16.tar.gz
cd automake-1.16
./configure --prefix=/usr/local
make 
sudo make install

然后再回到tesseract,重新执行autogen.sh、configure。

cd tesseract
./autogen
./configure --prefix=/usr/local

2、报错libtool文件版本不一致

libtool: Version mismatch error.  This is libtool 2.4.6, but the
libtool: definition of this LT_INIT comes from libtool 2.2.8.
libtool: You should recreate aclocal.m4 with macros from libtool 2.4.6
libtool: and run autoconf again.

使用:autoreconf -fiv重新生成aclocal.m4,然后再执行./configure --prefix=/usr/local

3、 报错Syntax error near unexpected token `LEPTONICA,'

查到是因为涉及依赖需要安装pkg-conf,Index of /releases

下载2.29.2版本,编译安装,这里注意需要用,否则会报错找不到pkg-conf

tar xfz pkg-conf-0.29.2.tar.gz 
cd pkg-conf-0.29.2
./configure --prefix=/usr/local --with-inernal-glib
make
sudo make install

4、报错configure: error: Leptonica 1.74 or higher is required. Try to install libleptonica-dev package

我下载了leptonica1.76.0版安装包​​​​​​Release Leptonica version 1.76.0 · DanBloomberg/leptonica · GitHub

 

tar xfz leptonica-1.76.0.tar.gz
cd leptonica-1.76.0
./configure --prefix=/usr/local
make
sudo make install

 再次回到tesseract,继续编译安装,终于成功。修改/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytesseract/pytesseract.py,中tesseract-cmd路径为/usr/local/bin/tesseract

python3.10 convertp2t.py

报错:

pytesseract.pytesseract.TesseractError: (1, 'Error in pixReadMemTiff: function not present Error in pixReadMem: tiff: no pix returned Error in pixaGenerateFontFromString: pix not made Error in bmfCreate: font pixa not made Error opening data file /usr/local/share/tessdata/chi_sim.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'chi_sim\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')

先安装语言包,

https://github.com/tesseract-ocr/tessdata_best/blob/main/chi_sim.traineddata

把下载的chi_sim.traineddata移到/usr/local/share/tessdata/

 设置环境变量TESSDATA_PREFIX

$ vi ~/.bash_profile

export TESSDATA_PREFIX=/usr/local/share/tessdata/

$ . ~/.bash_profile
$ echo $TESSDATA
$ /usr/local/share/tessdata/

再次执行,报错:

pytesseract.pytesseract.TesseractError: (1, 'Error in pixReadMemTiff: function not present Error in pixReadMem: tiff: no pix returned Error in pixaGenerateFontFromString: pix not made Error in bmfCreate: font pixa not made Error in pixReadStreamPng: function not present Error in pixReadStream: png: no pix returned Error in pixRead: pix not read Error during processing.')

这次不报语言错误,而是jpeg、png、tiff格式不支持,是因为没有安装libjpeg、libpng、libtiff。
 

curl --remote-name http://www.ijg.org/files/jpegsrc.v9.tar.gz
tar xfz jpegsrc.v9.tar.gz
cd jpegsrc.v9
./configure --prefix=/usr/local
make
sudo make install

curl --remote-name --location http://download.sourceforge.net/libpng/libpng-1.6.35.tar.gz
tar xfz libpng-1.6.35.tar.gz
cd libpng-1.6.35
./configure --prefix=/usr/local
make
sudo make install

然后重新编译和安装leptonica-1.76.0

$ cd leptonica-1.76.0
$ ./configure --prefix=/usr/local
$ make
$ sudo make intstall

其实中间报错libjpeg和leptonica的FALSE和TRUE定义冲突的问题,我用了个笨办法,把leptonica的定义改为了0和1,再执行编译通过。

然后执行tesseract看一下结果,jpeg、png的库出现了:

$ tesseract -v
tesseract 5.0.1-9-g31a968
 leptonica-1.76.0
  libjpeg 9 : libpng 1.6.35 : zlib 1.2.11
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1

执行python3.10 convertp2t.py,OH!Yeah!终于成功了!

举报

相关推荐

0 条评论