fastrag
Indexing with fastRAG
fastRAG can be used with any Haystack-based indexing store (which levrages on Haystack's DocumentStore
class).
fastRAG includes a directory scripts/indexing/
with scripts for creating indexes for all of fastRAG supported pipelines.
1. Elasticsearch:
For creating an Elasticsearch index (used with BM25 sparse retriever), the following script can be used:
python scripts/indexing/create_elastic.py \ --store config/store/elastic-local.yaml \ --data config/data/wikipedia_w100_hfdataset.yaml
2. FAISS:
For creating a FAISS-based dense index with DPR as an embedder/retriver, the following script can be used:
python scripts/indexing/create_faiss.py \ --store config/store/faiss.yaml \ --data config/data/wikipedia_w100_hfdataset.yaml \ --embedder config/retriever/dpr.yaml \ --index-save-path <save path>
3. Qdrant + SentenceTransformers:
For creating a Qdrant-based dense index with a sentence-transformer model as an embedder/retriver, the following script can be used:
python scripts/indexing/create_embeddings.py \ --data config/data/wikipedia_hf_6M.yaml \ --embedder config/embedder/sentence-transformer.yaml \ --store config/store/qdrant.yaml \ --batch_size 64
4. PLAID:
PLAID (Based on this paper) is a dense retrieval index engine that stores token vectors using an efficient algorithm. PLAID must be used with dense token embedder such as ColBERT which can embed tokens and utlizes a token-to-token ranking similarity method for ranking documents. More info on PLAID can be found in our models page.
PLAID Requirements:
-
Indexing with a GPU is supported with a RTX 3090 (Ampere) or newer and PyTorch should be installed with CUDA support using:
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116 -
PLAID utilized
faiss
for running kmeans clustering. For higher performance it is required to installfaiss-gpu
(for both CPU/GPU backends) viaconda
package manager. See this page for detailed instructions.
For creating a PLAID-based dense index, a ColBERT checkpoint is reuired in addition to a corpus and store configuration. The following script can be used to create such index:
python scripts/indexing/create_plaid.py \ --checkpoint=<path-to-colbert-model-checkpoint> \ --collection=<path to tsv collection> \ --index_save_path=<index-save-path> \ --gpus=0 \ --ranks=1 \ --name=plaid_test \ --kmeans_iterations=4