[论文阅读] Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters-CFANZ编程社区

[论文阅读] Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters

Keywords：multilingual,data balance，data sample，finetune，Curriculum training

论文地址：https://arxiv.org/abs/2007.03001

一、Shared sub-word tokens

To simplify this process, we create a shared token set across all languages using a Sentence Piece Model (SPM) [14] . Similar to [15], the shared sentence pieces are built by sampling the sentences using a multinomial distribution {s}i=1..N .

where the parameter α controls the sampling of languages with different frequencies.

二、Curriculum training of joint model

We faced convergence issues with joint model, when training on data from all languages. For these cases, we introduced a curriculum training [16] based approach, which incrementally adds each language after the model has been trained for a fixed num- ber of iterations or the Character Error Rate (CER) goes below 50% for the previously added language. We found that training converges easily for up to 51 languages using this method.

三、Datapreprocessing

For each language, we normal- ize the text by performing NFKC normalization and removing all punctuations.

四、Monolingual baseline models

For the high and mid resource lan- guages, we use 5000 and 2000 sub-word tokens respectively generated from SentencePiece toolkit [14]. For low resource languages, we use graphemes as the modelling units as it gave better performance over sub-word units.

五、Training data sampling for multilingual models

where nmax is the maximum number of training samples across any language and β is a tunable parameter that allows us to adjust the sampling of languages from their natural frequency, when β = 1, to a uniform distribution across languages when β =0.

六、Study of tunable parameters α,β

七、result

0 条评论