google-research
Contrack
In human-human conversations, Context Tracking deals with identifying important entities and keeping track of their properties and relationships. This is a challenging problem involving several subtasks such as entity recognition, attribute classification, coreference resolution and resolving plural mentions. The Contrack tool approaches this problem as an end-to-end modeling task where the conversational context is represented by an entity repository containing the entities mentioned so far, their properties and relationships between them. The repository is updated incrementally turn-by-turn, thus making it computationally efficient and capable of handling long conversations.
Contributions to the codebase are welcome and we would love to hear back from you if you find this codebase useful. Finally if you use Contrack for a research publication, please consider citing:
- Towards a Unified Approach to Entity-Centric Context Tracking in Conversations, Ulrich Rückert, Srinivas Sunkara, Abhinav Rastogi, Sushant Prakash, Pranav Khaitan
Installation
The following instructions are for installing on Ubuntu 18.04.
-
Make sure you have python3 and bazel installed. Follow the instructions here to install bazel.
-
Download the contrack subdirectory:
svn export https://github.com/google-research/google-research/trunk/contrack# Orgit clone https://github.com/google-research/google-research.git -
Create and enter a virtual environment (optional but preferred):
virtualenv -p python3 contrack_envsource ./contrack_env/bin/activate -
Install the dependencies:
cd contrackpython3 configure.pyIf you want to use an existing installation of tensorflow and gensim, run the configuration tool with the no-deps flag to skip dependency installation:
python3 configure.py --no-deps -
Compile the source code:
bazel build //:preprocess //:train //:predict
Usage
Here is an example on how to preprpocess a small example data file and train a model on it.
-
Download word2vec data used during preprocessing:
mkdir /tmp/contrack_dataexport DATA_DIR=/tmp/contrack_datawget -c -P $DATA_DIR "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"gunzip $DATA_DIR/GoogleNews-vectors-negative300.bin.gz -
Run the preprocess tool to convert text conversations to TFRecord format.
mkdir /tmp/contrack_exampleexport BASE_DIR=/tmp/contrack_example./bazel-bin/preprocess --input_file=data/example_conversations.txt \--output_dir=$BASE_DIR \--tokenizer_handle="https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3" \--bert_handle="https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3" \--wordvec_path=$DATA_DIR/GoogleNews-vectors-negative300.bin \--logtostderr -
Train a model on the TFRecord data. (A GPU is not necessary, but recommended for faster training.)
cp data/example_config.json $BASE_DIR/config.json./bazel-bin/train --train_data_glob $BASE_DIR/example_conversations.tfrecord \--config_path $BASE_DIR/config.json --model_path $BASE_DIR/model \--mode=two_steps --logtostderr -
Apply model on some dataset. The accuracy measures on the dataset will be output to the logfile.
./bazel-bin/predict --input_data_glob $BASE_DIR/example_conversations.tfrecord \--model_path $BASE_DIR/model --logtostderr