环境:MacOS 10.14.6 python3.10。上一篇已经安装了python3.10。
本篇目标是从图片中提取文字:建立文件convertp2t.py。
from PIL import Image
import pytesseract
import os
import pdb
image_address = input(r'请输入图片地址:')
if not os.path.exists(r'{}'.format(image_address)):
print(r'{} no exist'.format(image_address))
print(r'file name:{}'.format(image_address));
image = Image.open(r'{}'.format(image_address), 'r') #打开图片
#pdb.set_trace()
text = pytesseract.image_to_string(image, lang='chi_sim') #图片转文字
print(text) #打印结果
开始调试之路。
先后报错:
tesseract is not installed or it's not in your path
pytesseract.pytesseract.TesseractError: (2, 'Usage: pytesseract [-l lang] in
需要安装tesseract,用pip3 install tesseract,还不行。
修改/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytesseract/pytesseract.py,把tesseract_cmd路径改为tesseract所在路径还不行。
于是下载tesseract源码重新安装:
git clone https://github.com/tesseract-ocr/tesseract.git
cd tesseract
./autogen
中间遇到一些错误。
1、m4、autoconf、automake、libtool没有安装。
于是上mirrors.kernel.org/gnu下载m4、autoconf、automake,我没有都用最新版本,主要考虑语法可能有变化。m4用了1.4.6、autoconf用2.69、automake用1.16、libtool用2.4.6。
分别编译后安装。
tar xfz m4-1.4.6.tar.gz
cd m4-1.4.6
./configure --prefix=/usr/local
make
sudo make install
cd ../
tar xfz autoconf-2.69.tar.gz
cd autoconf-2.69
./configure --prefix=/usr/local
make
sudo make install
cd ../
tar xfz automake-1.16.tar.gz
cd automake-1.16
./configure --prefix=/usr/local
make
sudo make install
然后再回到tesseract,重新执行autogen.sh、configure。
cd tesseract
./autogen
./configure --prefix=/usr/local
2、报错libtool文件版本不一致
libtool: Version mismatch error. This is libtool 2.4.6, but the
libtool: definition of this LT_INIT comes from libtool 2.2.8.
libtool: You should recreate aclocal.m4 with macros from libtool 2.4.6
libtool: and run autoconf again.
使用:autoreconf -fiv重新生成aclocal.m4,然后再执行./configure --prefix=/usr/local
3、 报错Syntax error near unexpected token `LEPTONICA,'
查到是因为涉及依赖需要安装pkg-conf,Index of /releases
下载2.29.2版本,编译安装,这里注意需要用,否则会报错找不到pkg-conf
tar xfz pkg-conf-0.29.2.tar.gz
cd pkg-conf-0.29.2
./configure --prefix=/usr/local --with-inernal-glib
make
sudo make install
4、报错configure: error: Leptonica 1.74 or higher is required. Try to install libleptonica-dev package
我下载了leptonica1.76.0版安装包Release Leptonica version 1.76.0 · DanBloomberg/leptonica · GitHub
tar xfz leptonica-1.76.0.tar.gz
cd leptonica-1.76.0
./configure --prefix=/usr/local
make
sudo make install
再次回到tesseract,继续编译安装,终于成功。修改/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytesseract/pytesseract.py,中tesseract-cmd路径为/usr/local/bin/tesseract
python3.10 convertp2t.py
报错:
pytesseract.pytesseract.TesseractError: (1, 'Error in pixReadMemTiff: function not present Error in pixReadMem: tiff: no pix returned Error in pixaGenerateFontFromString: pix not made Error in bmfCreate: font pixa not made Error opening data file /usr/local/share/tessdata/chi_sim.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'chi_sim\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')
先安装语言包,
https://github.com/tesseract-ocr/tessdata_best/blob/main/chi_sim.traineddata
把下载的chi_sim.traineddata移到/usr/local/share/tessdata/
设置环境变量TESSDATA_PREFIX
$ vi ~/.bash_profile
export TESSDATA_PREFIX=/usr/local/share/tessdata/
$ . ~/.bash_profile
$ echo $TESSDATA
$ /usr/local/share/tessdata/
再次执行,报错:
pytesseract.pytesseract.TesseractError: (1, 'Error in pixReadMemTiff: function not present Error in pixReadMem: tiff: no pix returned Error in pixaGenerateFontFromString: pix not made Error in bmfCreate: font pixa not made Error in pixReadStreamPng: function not present Error in pixReadStream: png: no pix returned Error in pixRead: pix not read Error during processing.')
这次不报语言错误,而是jpeg、png、tiff格式不支持,是因为没有安装libjpeg、libpng、libtiff。
curl --remote-name http://www.ijg.org/files/jpegsrc.v9.tar.gz
tar xfz jpegsrc.v9.tar.gz
cd jpegsrc.v9
./configure --prefix=/usr/local
make
sudo make install
curl --remote-name --location http://download.sourceforge.net/libpng/libpng-1.6.35.tar.gz
tar xfz libpng-1.6.35.tar.gz
cd libpng-1.6.35
./configure --prefix=/usr/local
make
sudo make install
然后重新编译和安装leptonica-1.76.0
$ cd leptonica-1.76.0
$ ./configure --prefix=/usr/local
$ make
$ sudo make intstall
其实中间报错libjpeg和leptonica的FALSE和TRUE定义冲突的问题,我用了个笨办法,把leptonica的定义改为了0和1,再执行编译通过。
然后执行tesseract看一下结果,jpeg、png的库出现了:
$ tesseract -v
tesseract 5.0.1-9-g31a968
leptonica-1.76.0
libjpeg 9 : libpng 1.6.35 : zlib 1.2.11
Found AVX2
Found AVX
Found FMA
Found SSE4.1
执行python3.10 convertp2t.py,OH!Yeah!终于成功了!