google-research
BERTSeq2Seq
This repository contains the code to query our best models (served as TensorFlow Hub models) and their predictions on various academic text-generation benchmarks from our paper "Leveraging Pre-trained Checkpoints for Sequence Generation Tasks" at TACL 2020.
Please cite our paper if you use our data or models.
@article{rothe_tacl20,
author = {Rothe, Sascha and Narayan, Shashi and Severyn, Aliaksei},
title = {Leveraging Pre-trained Checkpoints for Sequence Generation Tasks},
journal = {Transactions of the Association for Computational Linguistics},
volume = {8},
number = {},
pages = {264-280},
year = {2020}
}
Introduction
Unsupervised pre-training of large neural models has recently revolutionized Natural Language Processing. We developed a Transformer-based sequence-to-sequence model that is compatible with publicly available pre-trained BERT, GPT-2 and RoBERTa checkpoints and achieved new state-of-the-art results on Machine Translation, Text Summarization, Sentence Splitting, and Sentence Fusion. We believe that NLP researchers will find our dataset with model predictions as a valuable resource to compare pre-trained text generation models and to derive actionable insights.
Predictions
The dataset consists of our sequence-to-sequence model predictions on academic datasets for text generation: Sentence Fusion (DiscoFuse), Sentence Splitting (WikiSplit), Summarization (XSum, CNN/DailyMail and Gigaword) and Machine Translation (WMT 2014 and 2016). Our dataset will be a valuable resource to compare pre-trained text generation models.
The dataset consists of json files with lists of dictionaries
{
“target”: <string>,
“prediction”: <string>
}
Here, “prediction” is the model generated text and “target” is the reference text.
- MT(DE ->EN): WMT 2014 and WMT 2016
- MT(EN->DE): WMT 2014 and WMT 2016
- Sentence Fusion: DiscoFuse
- Sentence Splitting: WikiSplit
- Summarization: Gigaword, CNN/DailyMail and XSum
TFHub Modules
Here is the code to query our best models served as TensorFlow Hub models.
# TF1 version
import tensorflow.compat.v1 as tf
import tensorflow_hub as hub
import tensorflow_text as tf_text
MT(DE ->EN)
text_generator = hub.Module(
'[https://tfhub.dev/google/bertseq2seq/bert24_de_en/1](https://tfhub.dev/google/bertseq2seq/bert24_de_en/1)')
de_sents = ['Satz 1', 'Satz 2']
en_sents = text_generator(en_sents)
MT(EN->DE)
text_generator = hub.Module(
'[https://tfhub.dev/google/bertseq2seq/bert24_en_de/1](https://tfhub.dev/google/bertseq2seq/bert24_en_de/1)')
en_sents = ['Sentence 1', 'Sentence 2']
de_sents = text_generator(en_sents)
Sentence Fusion
text_generator = hub.Module(
'[https://tfhub.dev/google/bertseq2seq/roberta24_discofuse/1](https://tfhub.dev/google/bertseq2seq/roberta24_discofuse/1)')
input_texts = ['Sentence 1a Sentence 1b',
'Sentence 2a Sentence 2b Sentence 2c']
output_sents = text_generator(input_texts)
Sentence Splitting
text_generator = hub.Module(
'[https://tfhub.dev/google/bertseq2seq/roberta24_wikisplit/1](https://tfhub.dev/google/bertseq2seq/roberta24_wikisplit/1)')
input_sentences = ['Long Sentence 1', 'Long Sentence 2']
output_texts = text_generator(input_sentences)
Summarization(Title Generation)
text_generator = hub.Module(
'[https://tfhub.dev/google/bertseq2seq/roberta24_gigaword/1](https://tfhub.dev/google/bertseq2seq/roberta24_gigaword/1)')
input_sents = ['This is the first sentence.', 'This is the second sentence.']
output_summaries = text_generator(input_sents)
Summarization (Highlight Generation)
text_generator = hub.Module(
'[https://tfhub.dev/google/bertseq2seq/roberta24_cnndm/1](https://tfhub.dev/google/bertseq2seq/roberta24_cnndm/1)')
input_documents = ['This is text from the first document.',
'This is text from the second document.']
output_summaries = text_generator(input_documents)
Extreme Summarization
text_generator = hub.Module(
'[https://tfhub.dev/google/bertseq2seq/roberta24_bbc/1](https://tfhub.dev/google/bertseq2seq/roberta24_bbc/1)')
input_documents = ['This is text from the first document.',
'This is text from the second document.']
output_summaries = text_generator(input_documents)
Tokenizers
- SentencePiece Tokenizer: vocab file and model file.
- WordPiece Tokenizer: vocab file.
Contact us
If you have a technical question regarding the dataset or publication, please create an issue in this repository. This is the fastest way to reach us.
If you would like to share feedback or report concerns, please email us at berts2s@google.com.