0
点赞
收藏
分享

微信扫一扫

Python PaddleNLP实现自动生成虎年藏头诗

卿卿如梦 2022-01-23 阅读 107

这篇文章主要介绍了利用Python PaddleNLP实现自动生成虎年藏头诗功能,文中的示例代码讲解详细,感兴趣的同学可以跟随小编一起试一试。Python编程学习资料点击免费领取

目录

1.paddlenlp升级

2.提取诗头

3.生成词表

4.定义dataset

二、定义模型并训练

1.模型定义

2.模型训练

3.模型保存

三、生成藏头诗

总结


一、 数据处理

本项目中利用古诗数据集作为训练集,编码器接收古诗的每个字的开头,解码器利用编码器的信息生成所有的诗句。为了诗句之间的连贯性,编码器同时也在诗头之前加上之前诗句的信息。举例:

“白日依山尽,黄河入海流,欲穷千里目,更上一层楼。” 可以生成两个样本:

样本一:编码器输入,“白”;解码器输入,“白日依山尽,黄河入海流”

样本二:编码器输入,“白日依山尽,黄河入海流。欲”;解码器输入,“欲穷千里目,更上一层楼。”

1.paddlenlp升级

1

!pip install -U paddlenlp

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple

Collecting paddlenlp

[?25l  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/17/9b/4535ccf0e96c302a3066bd2e4d0f44b6b1a73487c6793024475b48466c32/paddlenlp-2.2.3-py3-none-any.whl (1.2MB)

[K     |████████████████████████████████| 1.2MB 11.2MB/s eta 0:00:01

[?25hRequirement already satisfied, skipping upgrade: h5py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (2.9.0)

Requirement already satisfied, skipping upgrade: colorlog in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (4.1.0)

Requirement already satisfied, skipping upgrade: colorama in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.4.4)

Requirement already satisfied, skipping upgrade: seqeval in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.2.2)

Requirement already satisfied, skipping upgrade: jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.42.1)

Requirement already satisfied, skipping upgrade: multiprocess in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.70.11.1)

Requirement already satisfied, skipping upgrade: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from h5py->paddlenlp) (1.16.0)

Requirement already satisfied, skipping upgrade: numpy>=1.7 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from h5py->paddlenlp) (1.20.3)

Requirement already satisfied, skipping upgrade: scikit-learn>=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp) (0.24.2)

Requirement already satisfied, skipping upgrade: dill>=0.3.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from multiprocess->paddlenlp) (0.3.3)

Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (1.6.3)

Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (2.1.0)

Requirement already satisfied, skipping upgrade: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (0.14.1)

Installing collected packages: paddlenlp

  Found existing installation: paddlenlp 2.1.1

    Uninstalling paddlenlp-2.1.1:

      Successfully uninstalled paddlenlp-2.1.1

Successfully installed paddlenlp-2.2.3

2.提取诗头

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

import re

poems_file = open("./data/data70759/poems_zh.txt", encoding="utf8")

# 对读取的每一行诗句,统计每一句的词头

poems_samples = []

poems_prefix = []

poems_heads = []

for line in poems_file.readlines():

    line_ = re.sub('。', ' ', line)

    line_ = line_.split()

    # 生成训练样本

    for i, p in enumerate(line_):

        poems_heads.append(p[0])

        poems_prefix.append('。'.join(line_[:i]))

        poems_samples.append(p + '。')

# 输出文件信息

for i in range(20):

    print("poems heads:{}, poems_prefix: {}, poems:{}".format(poems_heads[i], poems_prefix[i], poems_samples[i]))

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

poems heads:欲, poems_prefix: , poems:欲出未出光辣达,千山万山如火发。

poems heads:须, poems_prefix: 欲出未出光辣达,千山万山如火发, poems:须臾走向天上来,逐却残星赶却月。

poems heads:未, poems_prefix: , poems:未离海底千山黑,才到天中万国明。

poems heads:满, poems_prefix: , poems:满目江山四望幽,白云高卷嶂烟收。

poems heads:日, poems_prefix: 满目江山四望幽,白云高卷嶂烟收, poems:日回禽影穿疏木,风递猿声入小楼。

poems heads:远, poems_prefix: 满目江山四望幽,白云高卷嶂烟收。日回禽影穿疏木,风递猿声入小楼, poems:远岫似屏横碧落,断帆如叶截中流。

poems heads:片, poems_prefix: , poems:片片飞来静又闲,楼头江上复山前。

poems heads:飘, poems_prefix: 片片飞来静又闲,楼头江上复山前, poems:飘零尽日不归去,帖破清光万里天。

poems heads:因, poems_prefix: , poems:因登巨石知来处,勃勃元生绿藓痕。

poems heads:静, poems_prefix: 因登巨石知来处,勃勃元生绿藓痕, poems:静即等闲藏草木,动时顷刻徧乾坤。

poems heads:横, poems_prefix: 因登巨石知来处,勃勃元生绿藓痕。静即等闲藏草木,动时顷刻徧乾坤, poems:横天未必朋元恶,捧日还曾瑞至尊。

poems heads:不, poems_prefix: 因登巨石知来处,勃勃元生绿藓痕。静即等闲藏草木,动时顷刻徧乾坤。横天未必朋元恶,捧日还曾瑞至尊, poems:不独朝朝在巫峡,楚王何事谩劳魂。

poems heads:若, poems_prefix: , poems:若教作镇居中国,争得泥金在泰山。

poems heads:才, poems_prefix: , poems:才闻暖律先偷眼,既待和风始展眉。

poems heads:嚼, poems_prefix: , poems:嚼处春冰敲齿冷,咽时雪液沃心寒。

poems heads:蒙, poems_prefix: , poems:蒙君知重惠琼实,薄起金刀钉玉深。

poems heads:深, poems_prefix: , poems:深妆玉瓦平无垅,乱拂芦花细有声。

poems heads:片, poems_prefix: , poems:片逐银蟾落醉觥。

poems heads:巧, poems_prefix: , poems:巧剪银花乱,轻飞玉叶狂。

poems heads:寒, poems_prefix: , poems:寒艳芳姿色尽明。

3.生成词表

1

2

3

4

5

6

7

8

# 用PaddleNLP生成词表文件,由于诗文的句式较短,我们以单个字作为词单元生成词表

from paddlenlp.data import Vocab

vocab = Vocab.build_vocab(poems_samples, unk_token="<unk>", pad_token="<pad>", bos_token="<", eos_token=">")

vocab_size = len(vocab)

print("vocab size", vocab_size)

print("word to idx:", vocab.token_to_idx)

4.定义dataset

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

# 定义数据读取器

from paddle.io import Dataset, BatchSampler, DataLoader

import numpy as np

class PoemDataset(Dataset):

    def __init__(self, poems_data, poems_heads, poems_prefix, vocab, encoder_max_len=128, decoder_max_len=32):

        super(PoemDataset, self).__init__()

        self.poems_data = poems_data

        self.poems_heads = poems_heads

        self.poems_prefix = poems_prefix

        self.vocab = vocab

        self.tokenizer = lambda x: [vocab.token_to_idx[x_] for x_ in x]

        self.encoder_max_len = encoder_max_len

        self.decoder_max_len = decoder_max_len

    def __getitem__(self, idx):

        eos_id = vocab.token_to_idx[vocab.eos_token]

        bos_id = vocab.token_to_idx[vocab.bos_token]

        pad_id = vocab.token_to_idx[vocab.pad_token]

        # 确保encoder和decoder的输出都小于最大长度

        poet = self.poems_data[idx][:self.decoder_max_len - 2# -2 包含bos_id和eos_id

        prefix = self.poems_prefix[idx][- (self.encoder_max_len - 3):]  # -3 包含bos_id, eos_id, 和head的编码

        # 对输入输出编码

        sample = [bos_id] + self.tokenizer(poet) + [eos_id]

        prefix = self.tokenizer(prefix) if prefix else []

        heads = prefix + [bos_id] + self.tokenizer(self.poems_heads[idx]) + [eos_id]

        sample_len = len(sample)

        heads_len = len(heads)

        sample = sample + [pad_id] * (self.decoder_max_len - sample_len)

        heads = heads + [pad_id] * (self.encoder_max_len - heads_len)

        mask = [1] * (sample_len - 1) + [0] * (self.decoder_max_len - sample_len) # -1 to make equal to out[2]

        out = [np.array(d, "int64") for d in [heads, heads_len, sample, sample, mask]]

        out[2] = out[2][:-1]

        out[3] = out[3][1:, np.newaxis]

        return out

    def shape(self):

        return [([None, self.encoder_max_len], 'int64', 'src'),

                ([None, 1], 'int64', 'src_length'),

                ([None, self.decoder_max_len - 1],'int64', 'trg')], \

               [([None, self.decoder_max_len - 1, 1], 'int64', 'label'),

                ([None, self.decoder_max_len - 1], 'int64', 'trg_mask')]

    def __len__(self):

        return len(self.poems_data)

dataset = PoemDataset(poems_samples, poems_heads, poems_prefix, vocab)

batch_sampler = BatchSampler(dataset, batch_size=2048)

data_loader = DataLoader(dataset, batch_sampler=batch_sampler)

二、定义模型并训练

1.模型定义

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

from Seq2Seq.models import Seq2SeqModel

from paddlenlp.metrics import Perplexity

from Seq2Seq.loss import CrossEntropyCriterion

import paddle

from paddle.static import InputSpec

# 参数

lr = 1e-6

max_epoch = 20

models_save_path = "./checkpoints"

encoder_attrs = {"vocab_size": vocab_size, "embed_dim": 200, "hidden_size": 128, "num_layers": 4, "dropout": .2,

                    "direction": "bidirectional", "mode": "GRU"}

decoder_attrs = {"vocab_size": vocab_size, "embed_dim": 200, "hidden_size": 128, "num_layers": 4, "direction": "forward",

                    "dropout": .2, "mode": "GRU", "use_attention": True}

# inputs shape and label shape

inputs_shape, labels_shape = dataset.shape()

inputs_list = [InputSpec(input_shape[0], input_shape[1], input_shape[2]) for input_shape in inputs_shape]

labels_list = [InputSpec(label_shape[0], label_shape[1], label_shape[2]) for label_shape in labels_shape]

net = Seq2SeqModel(encoder_attrs, decoder_attrs)

model = paddle.Model(net, inputs_list, labels_list)

model.load("./final_models/model")

opt = paddle.optimizer.Adam(learning_rate=lr, parameters=model.parameters())

model.prepare(opt, CrossEntropyCriterion(), Perplexity())

1

2

W0122 21:03:30.616776   166 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1

W0122 21:03:30.620450   166 device_context.cc:465] device: 0, cuDNN Version: 7.6.

2.模型训练

1

2

# 训练,训练时间较长,已提供了训练好的模型(./final_models/model)

model.fit(train_data=data_loader, epochs=max_epoch, eval_freq=1, save_freq=5, save_dir=models_save_path, shuffle=True)

3.模型保存

1

2

# 保存

model.save("./final_models/model")

三、生成藏头诗

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

import warnings

def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):

    """

    Post-process the decoded sequence.

    """

    eos_pos = len(seq) - 1

    for i, idx in enumerate(seq):

        if idx == eos_idx:

            eos_pos = i

            break

    seq = [idx for idx in seq[:eos_pos + 1]

           if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)]

    return seq

# 定义用于生成祝福语的类

from paddlenlp.data.tokenizer import JiebaTokenizer

class GenPoems():

    # content (str): the str to generate poems, like "恭喜发财"

    # vocab: the instance of paddlenlp.data.vocab.Vocab

    # model: the Inference Model

    def __init__(self, vocab, model):

        self.bos_id = vocab.token_to_idx[vocab.bos_token]

        self.eos_id = vocab.token_to_idx[vocab.eos_token]

        self.pad_id = vocab.token_to_idx[vocab.pad_token]

        self.tokenizer = lambda x: [vocab.token_to_idx[x_] for x_ in x]

        self.model = model

        self.vocab = vocab

    def gen(self, content, max_len=128):

        # max_len is the encoder_max_len in Seq2Seq Model.

        out = []

        vocab_list = list(vocab.token_to_idx.keys())

        for w in content:

            if w in vocab_list:

                content = re.sub("([。,])", '', content)

                heads = out[- (max_len - 3):] + [self.bos_id] + self.tokenizer(w) + [self.eos_id]

                len_heads = len(heads)

                heads = heads + [self.pad_id] * (max_len - len_heads)

                x = paddle.to_tensor([heads], dtype="int64")

                len_x = paddle.to_tensor([len_heads], dtype='int64')

                pred = self.model.predict_batch(inputs = [x, len_x])[0]

                out += self._get_results(pred)[0]

            else:

                warnings.warn("{} is not in vocab list, so it is skipped.".format(w))

                pass

        out = ''.join([self.vocab.idx_to_token[id] for id in out])

        return out

     

    def _get_results(self, pred):

        pred = pred[:, :, np.newaxis] if len(pred.shape) == 2 else pred

        pred = np.transpose(pred, [0, 2, 1])

        outs = []

        for beam in pred[0]:

            id_list = post_process_seq(beam, self.bos_id, self.eos_id)

            outs.append(id_list)

        return outs

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

# 载入预测模型

from Seq2Seq.models import Seq2SeqInferModel

import paddle

encoder_attrs = {"vocab_size": vocab_size, "embed_dim": 200, "hidden_size": 128, "num_layers": 4, "dropout": .2,

                    "direction": "bidirectional", "mode": "GRU"}

decoder_attrs = {"vocab_size": vocab_size, "embed_dim": 200, "hidden_size": 128, "num_layers": 4, "direction": "forward",

                    "dropout": .2, "mode": "GRU", "use_attention": True}

infer_model = paddle.Model(Seq2SeqInferModel(encoder_attrs,

                                             decoder_attrs,

                                             bos_id=vocab.token_to_idx[vocab.bos_token],

                                             eos_id=vocab.token_to_idx[vocab.eos_token],

                                             beam_size=10,

                                             max_out_len=256))

infer_model.load("./final_models/model")

1

2

3

4

5

6

7

8

9

10

11

# 送新年祝福

# 当然,表白也可以

generator = GenPoems(vocab, infer_model)

content = "生龙活虎"

poet = generator.gen(content)

for line in poet.strip().split('。'):

    try:

        print("{}\t{}。".format(line[0], line))

    except:

        pass

输出结果

总结

这个项目介绍了如何训练一个生成藏头诗的模型,从结果可以看出,模型已经具有一定的生成诗句的能力。但是,限于训练集规模和训练时间,生成的诗句还有很大的改进空间,未来还将进一步优化这个模型,敬请期待。

以上就是Python PaddleNLP实现自动生成虎年藏头诗的详细内容。

 

 

 

 

 

 

举报

相关推荐

0 条评论