treerag
Описание
Tree-based Retrieval Augmented Generation
Языки
- Python91%
- Jupyter Notebook9%
год назад
год назад
год назад
год назад
год назад
год назад
год назад
год назад
год назад
README.md
Tree-based Retrieval Augmented Generation
A research project that implements the concept of hybrid search for documents containing a large amount of facts, named entities, numerical values, dates, and other information.
Overview
The project's main ideas are embedded in the following algorithms:
- The data warehouse is based on a vector database (Milvus).
- Records in the warehouse are organized as a forest of trees, with the root node representing the entire document (file) and child nodes representing fragments of the document. Each fragment can be further divided into smaller parts and stored in corresponding nodes in the document tree.
- Search is performed in two stages: semantic search using embeddings, and precise search by entities full match in results from stage 1.
Library divided on modules by functionality:
Components
Loaders
: Converts PDF files to markdown format, with CUDA acceleration supportPDFLoader
Extractors
: Extracts named entities from text documentsEntitiesExtractor
Storage
: A hierarchical storage system that organizes document chunks in a tree structure for efficient retrievalTreeStorage- Supports embedding-based search
- Configurable chunk sizes
- Branching factor customization
Models
: Text embedding model used for semantic representationNomicEmbedTextV2Moe
Pipelines
-
: Combines storage and extraction capabilities for streamlined document processing and retrieval. For now it has methods:Pipelines- semantic search for the given query.semantic_search()- load pdf files data to storage from file of folder.load_pdf()- find documents containing the given query.find()
Quick start
Install
Create storage and load documents
Create pipeline and load pdf documents
Software stack
TreeRag is built on the following technologies:
Core Libraries
- PyTorch (≥2.6.0): Deep learning framework
- Flash Attention (≥2.7.4): Efficient attention mechanism implementation
NLP & Language Processing
- Spacy (≥3.8.4): Industrial-strength NLP
- Language Models:
: English language model (small)en_core_web_sm: English transformer-based modelen_core_web_trf: Russian language model (small)ru_core_news_sm
- Language Models:
- Sentence Transformers (≥3.4.1): State-of-the-art text embeddings
- Docling (≥2.25.2): PDf Document processing toolkit
Vector Database & Retrieval
- PyMilvus (≥2.5.4): Vector database client with model integration
- Datasets (≥3.3.2): Hugging Face datasets library for data management
LLM Integration
- LiteLLM (≥1.61.20): Simplified LLM API access
Utilities
- Jinja2 (≥3.1.5): Template engine for text generation
- Loguru (≥0.7.3): Logging library
The project requires Python 3.11 or higher.