treerag

0

Описание

Tree-based Retrieval Augmented Generation

Языки

  • Python91%
  • Jupyter Notebook9%
README.md

Tree-based Retrieval Augmented Generation

A research project that implements the concept of hybrid search for documents containing a large amount of facts, named entities, numerical values, dates, and other information.

Overview

The project's main ideas are embedded in the following algorithms:

  • The data warehouse is based on a vector database (Milvus).
  • Records in the warehouse are organized as a forest of trees, with the root node representing the entire document (file) and child nodes representing fragments of the document. Each fragment can be further divided into smaller parts and stored in corresponding nodes in the document tree.
  • Search is performed in two stages: semantic search using embeddings, and precise search by entities full match in results from stage 1.

Library divided on modules by functionality:

Components

Loaders

  • PDFLoader
    : Converts PDF files to markdown format, with CUDA acceleration support

Extractors

  • EntitiesExtractor
    : Extracts named entities from text documents

Storage

  • TreeStorage
    : A hierarchical storage system that organizes document chunks in a tree structure for efficient retrieval
    • Supports embedding-based search
    • Configurable chunk sizes
    • Branching factor customization

Models

  • NomicEmbedTextV2Moe
    : Text embedding model used for semantic representation

Pipelines

  • Pipelines
    : Combines storage and extraction capabilities for streamlined document processing and retrieval. For now it has methods:

    semantic_search()
    - semantic search for the given query.
    load_pdf()
    - load pdf files data to storage from file of folder.
    find()
    - find documents containing the given query.

Quick start

Install

Create storage and load documents

Create pipeline and load pdf documents

Software stack

TreeRag is built on the following technologies:

Core Libraries

  • PyTorch (≥2.6.0): Deep learning framework
  • Flash Attention (≥2.7.4): Efficient attention mechanism implementation

NLP & Language Processing

  • Spacy (≥3.8.4): Industrial-strength NLP
    • Language Models:
      • en_core_web_sm
        : English language model (small)
      • en_core_web_trf
        : English transformer-based model
      • ru_core_news_sm
        : Russian language model (small)
  • Sentence Transformers (≥3.4.1): State-of-the-art text embeddings
  • Docling (≥2.25.2): PDf Document processing toolkit

Vector Database & Retrieval

  • PyMilvus (≥2.5.4): Vector database client with model integration
  • Datasets (≥3.3.2): Hugging Face datasets library for data management

LLM Integration

  • LiteLLM (≥1.61.20): Simplified LLM API access

Utilities

  • Jinja2 (≥3.1.5): Template engine for text generation
  • Loguru (≥0.7.3): Logging library

The project requires Python 3.11 or higher.