Keywords:multilingual,data balance,data sample,finetune,Curriculum training
论文地址:https://arxiv.org/abs/2007.03001
一、Shared sub-word tokens
To simplify this process, we create a shared token set across all languages using a Sentence Piece Model (SPM) [14] . Similar to [15], the shared sentence pieces are built by sampling the sentences using a multinomial distribution {s}i=1..N .
where the parameter α controls the sampling of languages with different frequencies.
二、Curriculum training of joint model
We faced convergence issues with joint model, when training on data from all languages. For these cases, we introduced a curriculum training [16] based approach, which incrementally adds each language after the model has been trained for a fixed num- ber of iterations or the Character Error Rate (CER) goes below 50% for the previously added language. We found that training converges easily for up to 51 languages using this method.
三、Datapreprocessing
For each language, we normal- ize the text by performing NFKC normalization and removing all punctuations.
四、Monolingual baseline models
For the high and mid resource lan- guages, we use 5000 and 2000 sub-word tokens respectively generated from SentencePiece toolkit [14]. For low resource languages, we use graphemes as the modelling units as it gave better performance over sub-word units.
五、Training data sampling for multilingual models
where nmax is the maximum number of training samples across any language and β is a tunable parameter that allows us to adjust the sampling of languages from their natural frequency, when β = 1, to a uniform distribution across languages when β =0.