

Data selection for machine translation

This repo contains code for experiments on data selection for machine translation. The model, data and experiments are described in our paper ( The main focus of this repo is to explore various tradeoffs and interactions between data selection and finetuning on out-of-domain and in domain data. The code is written in python / flax. The model is a vanilla transformer. The data used is from tfds.datasets. We use the WMT data; specifically Paracrawl and News Commentary.


All dependencies are listed in requirements.txt. Models are implemented using the flax/ jax libraries or Huggingface Transformers. Data is sourced from Tensorflow Datasets (


The main runner is This is an example below. There are two helper runners; and Both are to compute selection scores using either the Descriminative Classifier (DC) or Constrastive Data Selection (CDS) respectively.


python -- model_dir=models/ --dataset_name='newscommentary_paracrawl'
--batch_size=128 --num_train_steps=15000
--emb_dim=512 --mlp_dim=2048 --num_heads=8 --paracrawl_size=4500000
--vocab_path 'tokenizer/sentencepiece_model' --restore_checkpoints
--data_dir='data/' --chkpts_to_keep=1
--checkpoint_freq=5000 --eval_frequency=100
--pretrained_model_dir='pretrained_models/' --save_checkpoints=False
--is_scores_path='scores/scores.csv' --data_selection_size=5e5 --compute_bleu=False

Note: If there is no tokenizer, one will be created. data_dir must be populated. You can download and preprare the data using TF dataset builder ( The scores.csv file is a file of the selection scores for each example in the dataset.


@article{iter2021complementarity, title={On the Complementarity of Data Selection and Fine Tuning for Domain Adaptation}, author={Iter, Dan and Grangier, David}, journal={arXiv preprint arXiv:2109.07591}, year={2021}, url={} }

This code branches from the Flax WMT example:

@software{flax2020github, author = {Jonathan Heek and Anselm Levskaya and Avital Oliver and Marvin Ritter and Bertrand Rondepierre and Andreas Steiner and Marc van {Z}ee}, title = {{F}lax: A neural network library and ecosystem for {JAX}}, url = {}, version = {0.3.4}, year = {2020}, }

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.