Tree-based Retrieval Augmented Generation

A research project that implements the concept of hybrid search for documents containing a large amount of facts, named entities, numerical values, dates, and other information.

Overview

The project's main ideas are embedded in the following algorithms:

The data warehouse is based on a vector database (Milvus).
Records in the warehouse are organized as a forest of trees, with the root node representing the entire document (file) and child nodes representing fragments of the document. Each fragment can be further divided into smaller parts and stored in corresponding nodes in the document tree.
Search is performed in two stages: semantic search using embeddings, and precise search by entities full match in results from stage 1.

Library divided on modules by functionality:

Components

Loaders

PDFLoader: Converts PDF files to markdown format, with CUDA acceleration support

Extractors

EntitiesExtractor: Extracts named entities from text documents

Storage

TreeStorage: A hierarchical storage system that organizes document chunks in a tree structure for efficient retrieval
- Supports embedding-based search
- Configurable chunk sizes
- Branching factor customization

Models

NomicEmbedTextV2Moe: Text embedding model used for semantic representation

Pipelines

Pipelines
: Combines storage and extraction capabilities for streamlined document processing and retrieval. For now it has methods:

semantic_search()
- semantic search for the given query. load_pdf() - load pdf files data to storage from file of folder. find() - find documents containing the given query.

Quick start

Install

Create storage and load documents

Create pipeline and load pdf documents

Software stack

TreeRag is built on the following technologies:

Core Libraries

PyTorch (≥2.6.0): Deep learning framework
Flash Attention (≥2.7.4): Efficient attention mechanism implementation

NLP & Language Processing

Spacy (≥3.8.4): Industrial-strength NLP
- Language Models:
  - en_core_web_sm: English language model (small)
  - en_core_web_trf: English transformer-based model
  - ru_core_news_sm: Russian language model (small)
Sentence Transformers (≥3.4.1): State-of-the-art text embeddings
Docling (≥2.25.2): PDf Document processing toolkit

Vector Database & Retrieval

PyMilvus (≥2.5.4): Vector database client with model integration
Datasets (≥3.3.2): Hugging Face datasets library for data management

LLM Integration

LiteLLM (≥1.61.20): Simplified LLM API access

Utilities

Jinja2 (≥3.1.5): Template engine for text generation
Loguru (≥0.7.3): Logging library

The project requires Python 3.11 or higher.

treerag

Описание

Языки

IS
gurgutan
+docstrings
год назад
ba02177

Tree-based Retrieval Augmented Generation

Overview

Components

Loaders

Extractors

Storage

Models

Pipelines

Quick start

Install

Create storage and load documents

Create pipeline and load pdf documents

Software stack

Core Libraries

NLP & Language Processing

Vector Database & Retrieval

LLM Integration

Utilities

treerag

Описание

Языки

ISgurgutan+docstrings год назадba02177

Tree-based Retrieval Augmented Generation

Overview

Components

Loaders

Extractors

Storage

Models

Pipelines

Quick start

Install

Create storage and load documents

Create pipeline and load pdf documents

Software stack

Core Libraries

NLP & Language Processing

Vector Database & Retrieval

LLM Integration

Utilities

IS
gurgutan
+docstrings
год назад
ba02177