google-research

Форк
0

..
README.md

Dedal: Deep embedding and alignment of protein sequences.

This package contains all the necessary tools to reproduce the experiments presented in the dedal paper.

Install

To install dedal, it is necessary to clone the google-research repo:

git clone https://github.com/google-research/google-research.git

From the google_research folder, you may install the necessary requirements by executing:

pip install -r dedal/requirements.txt

Pre-trained model

DEDAL is available in TensorFlow Hub.

The model expects a tf.Tensor<tf.int32>[2B, 512] as inputs, representing a batch of B sequence pairs to be aligned right-padded to a maximum length of 512, including the special EOS token. Pairs are expected to be arranged consecutively in this batch, that is, inputs[2*b] and inputs[2*b + 1] represent the b-th sequence pair with b ranging from 0 up to B - 1 (inclusive).

By default, the model runs in "alignment" mode and its output is a Python dict containing:

  • 'sw_scores': a tf.Tensor<tf.float32>[B] with the alignment scores
  • 'homology_logits': a tf.Tensor<tf.float32>[B] with the homology detection logits
  • 'paths': a tf.Tensor<tf.float32>[B, 512, 512, 9] representing the predicted alignments
  • 'sw_params': a tuple of three tf.Tensor<tf.float32>[B, 512, 512] containing the contextual Smith-Waterman parameters (substitution scores, gap open and gap extend penalties)

Additional signatures are provided to run the model in "embedding" mode, in which case it returns a single tf.Tensor<tf.float32>[2B, 512, 768] with the embeddings of each input sequence.

import tensorflow as tf
import tensorflow_hub as hub
from dedal import infer # Requires google_research/google-research.
dedal_model = hub.load('https://tfhub.dev/google/dedal/3')
# "Gorilla" and "Mallard" sequences from [1, Figure 3].
protein_a = 'SVCCRDYVRYRLPLRVVKHFYWTSDSCPRPGVVLLTFRDKEICADPRVPWVKMILNKL'
protein_b = 'VKCKCSRKGPKIRFSNVRKLEIKPRYPFCVEEMIIVTLWTRVRGEQQHCLNPKRQNTVRLLKWY'
# Represents sequences as `tf.Tensor<tf.float32>[2, 512]` batch of tokens.
inputs = infer.preprocess(protein_a, protein_b)
# Aligns `protein_a` to `protein_b`.
align_out = dedal_model(inputs)
# Retrieves per-position embeddings of both sequences.
embeddings = dedal_model.call(inputs, embeddings_only=True)
# Postprocesses output and displays alignment.
output = infer.expand(
[align_out['sw_scores'], align_out['paths'], align_out['sw_params']])
output = infer.postprocess(output, len(protein_a), len(protein_b))
alignment = infer.Alignment(protein_a, protein_b, *output)
print(alignment)
# Displays the raw Smith-Waterman score and the homology detection logits.
print('Smith-Waterman score (uncorrected):', align_out['sw_scores'].numpy())
print('Homology detection logits:', align_out['homology_logits'].numpy())

Data

DEDAL was trained and evaluated on publicly available data, namely, the March 2018 release of UniRef50 and Pfam 34.0. The subdirectory preprocessing contains all of the code necessary to reproduce the preprocessing pipeline described in the paper. Sequence identifiers corresponding to each Pfam-A seed split can be downloaded here.

Training from scratch

The repository also contains all of the code necessary to reproduce the different training phases from scratch. For now, we assume the preprocessing pipeline has been run locally on the raw, publicly available data before training. We aim to make the preprocessed data available in the following weeks to simplify this process.

For example, to train a DEDAL model from scratch on the masked language modelling, pairwise sequence alignment and homology detection tasks, one may run the following command from the google_research folder:

python3 -m dedal.main -- \
  --base_dir /tmp/dedal \
  --gin_config data/uniref50.gin \
  --gin_config data/pfam34_alignment.gin \
  --gin_config data/pfam34_homology.gin \
  --gin_config model/dedal.gin \
  --gin_config task/finetune.gin \
  --gin_bindings UNIREF50_DATA_DIR=/path/to/masked_lm/data \
  --gin_bindings PFAM34_ALIGNMENT_DATA_DIR=/path/to/alignment/data \
  --gin_bindings PFAM34_HOMOLOGY_DATA_DIR=/path/to/homology/data \
  --gin_bindings MAIN_VOCAB_PATH=/path/to/main/vocab \
  --gin_bindings TOKEN_REPLACE_VOCAB_PATH=/path/to/token_replace/vocab \
  --gin_bindings ALIGNMENT_PATH_VOCAB_PATH=/path/to/alignment_path/vocab \
  --task train \
  --alsologtostderr

Note that the transformer architecture might be slow to train on a CPU and running on accelerators would greatly improve the training speed. The pre-trained DEDAL model was fit using 32 TPU v3 cores.

The base_dir flag is the folder where to write the checkpoints and to log the metrics. In the example above, it would be /tmp/dedal. To visualize the logged metrics, one can simply start a TensorBoard pointing to the given folder, such as:

%tensorboard --logdir /tmp/dedal

In case the training is interrupted, restarting the same command would not start the training over from scratch, but from the last available checkpoint. The frequency of checkpointing and logging can be changed from the task Gin-config files, e.g., task/finetune.gin.

The task flag enables to either run a training, an evaluation of a downstream training with its own eval. In evaluation mode, the training checkpoints will be loaded on the fly until the last one has been reached, such that one can run both an eval process along with a training one, so that the evaluation does not slow the training down. Alternatively, one can set separate_eval=False in the training loop so that eval and train will be run alternatively.

License

Licensed under the Apache 2.0 License.

Disclaimer

This is not an official Google product.

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.