kenlm使用研究-CFANZ编程社区

kenlm使用研究

yum install -y cmake
yum install -y boost
yum install -y boost-devel
yum install -y boost-doc
yum install -y zlib
yum install -y zlib-devel
yum install -y gcc gcc-c++ kernel-devel

wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz
mkdir kenlm/build
cd kenlm/build
cmake ..
make -j2

使用如下命令进行训练：
cd bin/
mkdir result

训练
字粒度：

词粒度：
./lmplz -o 3 --verbose_header --text people2014_words.txt --arpa result/people2014_words.arpa
./lmplz -o 3 --verbose_header --text corpus_seg.txt --arpa result/corpus_seg.arpa
./lmplz -o 3 --verbose_header --text test.txt --arpa result/test.arpa

上述参数需要根据自己的文件位置来调整。各个参数的含义：
-o n:最高采用n-gram语法
-verbose_header:在生成的文件头位置加上统计信息
--text text_file:指定存放预料的txt文件
--arpa:指定输出的arpa文件

压缩：
上面创建的模型为二进制，方便模型快速加载,其实*.klm文件就是*.bin都是二进制文件：
./build_binary ./result/people2014_words.arpa ./result/people2014_words.klm
./build_binary ./result/corpus_seg.arpa ./result/corpus_seg.klm

参考：http://ftp.gnu.org/gnu/gcc/gcc-7.3.0/

kenlm

0 条评论