md_rag
Описание
Векторизовать набор документов в формате markdown и сохранить в базе данных ChromaDB. Созданная база данных может в последующем использоваться для формирования дополнений к запросам к LLM (RAG приложение).
Языки
- Python100%
Markdown Vectorizer for RAG Applications
Overview
This project provides a robust solution for vectorizing markdown documents and storing them in ChromaDB, specifically designed for Retrieval-Augmented Generation (RAG) applications. The script recursively processes markdown files, chunks their content, generates embeddings, and stores them in a vector database for efficient semantic search.
Features
- 🔍 Recursive markdown file processing
- 📄 Text extraction from markdown files
- 🧩 Configurable text chunking
- 💡 Sentence transformer embeddings
- 🗃️ ChromaDB vector storage
- 📝 Flexible YAML configuration
Prerequisites
- Python 3.8+
- Basic understanding of markdown files and vector databases
Installation
- Clone the repository:
- Create a virtual environment (recommended):
- Install dependencies:
Configuration
Edit the file to customize:
Usage
Vectorizing Documents
Run the main script to process markdown files:
Querying Documents
The script includes a method to query the vectorized documents:
Advanced Configuration
- Change
to use different sentence transformersembedding_model - Adjust
andchunking.sizefor different text segmentationchunking.overlap - Modify
to point to your markdown document sourceinput_directory
Troubleshooting
- Ensure markdown files are UTF-8 encoded
- Check that input directory exists
- Verify dependencies are correctly installed
Performance Considerations
- Larger documents may require more processing time
- Memory usage depends on the number and size of markdown files
- Choose an embedding model that balances accuracy and performance
Contributing
- Fork the repository
- Create your feature branch (
)git checkout -b feature/AmazingFeature - Commit your changes (
)git commit -m 'Add some AmazingFeature' - Push to the branch (
)git push origin feature/AmazingFeature - Open a Pull Request
License
Distributed under the MIT License. See for more information.
Contact
Your Name - your.email@example.com
Project Link: https://github.com/yourusername/markdown-vectorizer