google-research

Форк
0

..
/
routing_transformer 
4 года назад
README.md

Efficient content-based sparse attention with Routing Transformers

Routing attention

Code-base accompanying the paper (to appear in TACL). See also the accompanying slides for a quick overview.

Table of Contents

Updates

  • Routing Transformer + REALM is now SOTA on long form Question Answering (QA) on the ELI5 data-set on the Knowledge Intensive Language Tasks (KILT) benchmark from Facebook AI, with significant improvements in generation quality over BART, RAG, T5/Mesh TF , e.g. +4.11, +5.78, +9.14 Rouge-L improvement over T5/Mesh TF, BART + DPR and RAG respectively. Check out the source code and pre-trained model weights at Kalpesh's Github repository.

Pre-trained PG-19 Checkpoint

ModelHparamsContext LengthData-setVocabDownload
Local-basepg19_local8k8192PG-19vocab98Kcheckpoint.zip
RT-basepg19_local_cluster8k8192PG-19vocab98Kcheckpoint.zip
RT-basepg19_local_cluster8k8192ELI-5vocab98Kcheckpoint.zip

Explanation of hyperparameters

Local Attention

  • local_num_heads: Number of local attention heads
  • query_shape: This represents the shape of the query block.
    • For 1-d local attention with block size b, this would be (b,)
  • memory_query_shape: This represents the query shape of memory antecedent and is useful for encoder-decoder attention
    • This is usually set the same as query_shape by default
    • This is useful when inputs and targets are of different lengths
    • E.g., if inputs are of length 4096 and targets of length 8192
    • Plausible setting:query_shape = (256,), memory_flange = (256,) and memory_query_shape = (128,)
    • This is because with block size 256, the targets will have 32 blocks
    • To match this in enc-dec attention, the inputs must have 32 blocks
    • This is why we set memory_query_shape = (4096/32,) = (128,)
  • memory_flange: This represents the overlap of the memory block
    • Example setting: query_shape = (b,) and memory_flange = (m * b, )
    • Masked: Each query block attends to m previous blocks
    • Unmasked: Each query block attends to m previous & m subsequent blocks
    • Setting this to (0,) means all the blocks are independent of each other
    • Setting to (0,) is used for full attention, or for axial attention
    • This must be a multiple of query_shape in every dimension
  • Example setting can be found in sparse_transformer.py under pg19_local8k

Routing Attention

  • sparsity_cluster_num_heads: Number of routing attention heads
  • sparsity_cluster_size: Number of clusters
  • sparsity_cluster_attention_window: Average size of each cluster
  • sparsity_skip_first: Number of initial layers to skip routing attention
    • sparsity_skip_first = 0 would have routing attention in every layer
    • sparsity_skip_first equalling total layers would have no routing attention
  • Example setting can be found in sparse_transformer.py under pg19_local_cluster8k

Samples

PG-19 (sequence length 8k)

Unconditional Samples

Conditional Samples

Document Machine Translation (sequence length 4k)

Acknowledgments

The authors would like to thank Phillip Wang and Aran Komatsuzaki for a Pytorch implementation of Routing Transformer. The authors would also like to thank Yonghui Wu, Weikang Zhou and Dehao Chen for helpful feedback in improving the implementation of this work. The authors would also like to thank anonymous reviewers and the Action Editor Xavier Carreras of TACL for their constructive comments which helped improve the exposition of this work.

How to Cite

If you extend or use this work, please cite the paper where it was introduced:

@article{roy2020efficient,
  title={Efficient content-based sparse attention with routing transformers},
  author={Roy, Aurko and Saffar, Mohammad and Vaswani, Ashish and Grangier, David},
  journal={arXiv preprint arXiv:2003.05997},
  year={2020}
}

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.