md_rag

0

Описание

Векторизовать набор документов в формате markdown и сохранить в базе данных ChromaDB. Созданная база данных может в последующем использоваться для формирования дополнений к запросам к LLM (RAG приложение).

Языки

  • Python100%
README.md

Markdown Vectorizer for RAG Applications

Overview

This project provides a robust solution for vectorizing markdown documents and storing them in ChromaDB, specifically designed for Retrieval-Augmented Generation (RAG) applications. The script recursively processes markdown files, chunks their content, generates embeddings, and stores them in a vector database for efficient semantic search.

Features

  • 🔍 Recursive markdown file processing
  • 📄 Text extraction from markdown files
  • 🧩 Configurable text chunking
  • 💡 Sentence transformer embeddings
  • 🗃️ ChromaDB vector storage
  • 📝 Flexible YAML configuration

Prerequisites

  • Python 3.8+
  • Basic understanding of markdown files and vector databases

Installation

  1. Clone the repository:
  1. Create a virtual environment (recommended):
  1. Install dependencies:

Configuration

Edit the

config.yaml
file to customize:

Usage

Vectorizing Documents

Run the main script to process markdown files:

Querying Documents

The script includes a method to query the vectorized documents:

Advanced Configuration

  • Change
    embedding_model
    to use different sentence transformers
  • Adjust
    chunking.size
    and
    chunking.overlap
    for different text segmentation
  • Modify
    input_directory
    to point to your markdown document source

Troubleshooting

  • Ensure markdown files are UTF-8 encoded
  • Check that input directory exists
  • Verify dependencies are correctly installed

Performance Considerations

  • Larger documents may require more processing time
  • Memory usage depends on the number and size of markdown files
  • Choose an embedding model that balances accuracy and performance

Contributing

  1. Fork the repository
  2. Create your feature branch (
    git checkout -b feature/AmazingFeature
    )
  3. Commit your changes (
    git commit -m 'Add some AmazingFeature'
    )
  4. Push to the branch (
    git push origin feature/AmazingFeature
    )
  5. Open a Pull Request

License

Distributed under the MIT License. See

LICENSE
for more information.

Contact

Your Name - your.email@example.com

Project Link: https://github.com/yourusername/markdown-vectorizer

Acknowledgments