fastrag

Форк
0

..
/
indexing 
10 месяцев назад
10 месяцев назад
10 месяцев назад
README.md

Indexing with fastRAG

fastRAG can be used with any Haystack-based indexing store (which levrages on Haystack's DocumentStore class). fastRAG includes a directory scripts/indexing/ with scripts for creating indexes for all of fastRAG supported pipelines.

1. Elasticsearch:

For creating an Elasticsearch index (used with BM25 sparse retriever), the following script can be used:

python scripts/indexing/create_elastic.py \
--store config/store/elastic-local.yaml \
--data config/data/wikipedia_w100_hfdataset.yaml

2. FAISS:

For creating a FAISS-based dense index with DPR as an embedder/retriver, the following script can be used:

python scripts/indexing/create_faiss.py \
--store config/store/faiss.yaml \
--data config/data/wikipedia_w100_hfdataset.yaml \
--embedder config/retriever/dpr.yaml \
--index-save-path <save path>

3. Qdrant + SentenceTransformers:

For creating a Qdrant-based dense index with a sentence-transformer model as an embedder/retriver, the following script can be used:

python scripts/indexing/create_embeddings.py \
--data config/data/wikipedia_hf_6M.yaml \
--embedder config/embedder/sentence-transformer.yaml \
--store config/store/qdrant.yaml \
--batch_size 64

4. PLAID:

PLAID (Based on this paper) is a dense retrieval index engine that stores token vectors using an efficient algorithm. PLAID must be used with dense token embedder such as ColBERT which can embed tokens and utlizes a token-to-token ranking similarity method for ranking documents. More info on PLAID can be found in our models page.

PLAID Requirements:

  1. Indexing with a GPU is supported with a RTX 3090 (Ampere) or newer and PyTorch should be installed with CUDA support using:

    pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
  2. PLAID utilized faiss for running kmeans clustering. For higher performance it is required to install faiss-gpu (for both CPU/GPU backends) via conda package manager. See this page for detailed instructions.

For creating a PLAID-based dense index, a ColBERT checkpoint is reuired in addition to a corpus and store configuration. The following script can be used to create such index:

python scripts/indexing/create_plaid.py \
--checkpoint=<path-to-colbert-model-checkpoint> \
--collection=<path to tsv collection> \
--index_save_path=<index-save-path> \
--gpus=0 \
--ranks=1 \
--name=plaid_test \
--kmeans_iterations=4

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.