firefeed
Описание
RSS-парсер и агрегатор с функциями искусственного интеллекта (AI)
Языки
- Python97,8%
- HTML2%
- Shell0,1%
- Dockerfile0,1%
FireFeed - AI-powered RSS-parser and agregator
A modern RSS-parser with AI support for automatic collection, processing, and distribution of news in multiple languages.
Official website: https://firefeed.net
Table of Contents
- Project Overview
- Key Features
- Technology Stack
- Architecture
- Installation and Setup
- Configuration
- API Documentation
- Development
- Project Structure
- License
Project Overview
FireFeed is a high-performance parsing system for automatic collection, processing, and distribution of news content. The project uses modern machine learning technologies for intelligent text processing and provides multilingual support for international audiences.
Key Features
AI-powered Content Processing
- Automatic news translation to 4 languages (Russian, German, French, English) using modern machine learning models (Helsinki-NLP OPUS-MT, M2M100) - optional via TRANSLATION_ENABLED
- Duplicate detection using semantic analysis and vector embeddings (Sentence Transformers) - optional via DUPLICATE_DETECTOR_ENABLED
- Intelligent image processing with automatic extraction and optimization
Multilingual Support
- Fully localized Telegram bot with support for 4 languages
- REST API with multilingual interface
- Adaptive translation system with terminology consideration
Flexible RSS System
- Automatic parsing of over 50 RSS feeds from various sources
- News categorization by topics (world news, technology, sports, economy, etc.)
- Personalized user subscriptions to categories and sources
- Custom RSS feeds - ability to add personal sources
Secure Architecture
- JWT authentication for API
- Password encryption using bcrypt
- Email validation with confirmation codes
- Secure secret storage through environment variables
High Performance
- Asynchronous architecture based on asyncio
- PostgreSQL connection pool for efficient database operations
- Task queues for parallel translation processing
- ML model caching for memory optimization
Technology Stack
Backend
- Python 3.11+ with asyncio
- FastAPI for REST API
- PostgreSQL with pgvector for semantic search
- Redis for storing API key usage data
- aiopg for asynchronous database queries
AI/ML
- Transformers (Hugging Face)
- Sentence Transformers for embeddings
- SpaCy for text processing
- Torch for computations
Integrations
- Telegram Bot API
- SMTP for email notifications
- Webhook support
Infrastructure
- Docker containerization
- systemd for service management
- nginx for proxying
Architecture
The project consists of several key components:
- Telegram Bot (
) - main user interaction interfaceapps/telegram_bot/ - RSS Parser Service (
) - background service for RSS feed parsingapps/rss_parser/ - REST API (
) - web API for external integrationsapps/api/ - Translation Services (
) - translation system with cachingservices/translation/ - Test analysis (
) - ML-based duplicate detection and text analysisservices/text_analysis/ - User Management (
) - user and subscription management serviceservices/user/
Telegram Bot
The Telegram bot serves as the primary interface for users to interact with the FireFeed system. It provides personalized news delivery, subscription management, and multilingual support.
Key Features
- Personalized News Delivery: Users receive news based on their category subscriptions in their preferred language
- Multilingual Interface: Full localization support for English, Russian, German, and French
- Subscription Management: Easy category-based subscription configuration through inline keyboards
- Automatic Publishing: News items are automatically published to configured Telegram channels
Publication Rate Limiting
To prevent spam and ensure fair usage, the bot implements sophisticated rate limiting for news publications:
Feed-Level Limits
Each RSS feed has configurable limits:
: Minimum time between publications from this feed (default: 60 minutes)cooldown_minutes: Maximum number of news items per hour from this feed (default: 10)max_news_per_hour
Telegram Publication Checks
Before publishing any news item to Telegram channels, the system performs two types of checks:
-
Count-based Limiting:
- Counts publications from the same feed within the last 60 minutes
- If count >=
, skips publicationmax_news_per_hour - Uses data from
tablerss_items_telegram_bot_published
-
Time-based Limiting:
- Checks time since last publication from the same feed
- If elapsed time <
, skips publicationcooldown_minutes
How It Works
This ensures that even if multiple news items are processed simultaneously from the same feed, only the allowed number will be published to Telegram, preventing rate limit violations and maintaining quality user experience.
Scalability and Reliability
- Horizontal scaling through microservice architecture
- Fault tolerance with automatic restarts and logging
- Performance monitoring with detailed telemetry
- Graceful shutdown for proper service termination
Service Architecture
The project uses modern service-oriented architecture with dependency injection to ensure high testability and maintainability.
RSS Services
RSSFetcher (apps/rss_parser/services/rss_fetcher.py)
Service for fetching and parsing RSS feeds.
Key Features:
- Asynchronous RSS feed fetching with semaphore support for concurrency control
- XML structure parsing with extraction of titles, content, and metadata
- Duplicate detection through built-in detector
- Media content extraction (images, videos)
Configuration:
RSSValidator (apps/rss_parser/services/rss_validator.py)
Service for RSS feed validation.
Key Features:
- URL availability checking with timeouts
- Validation result caching
- RSS structure correctness determination
Configuration:
RSSStorage (apps/rss_parser/services/rss_storage.py)
Service for RSS data database operations.
Key Features:
- Saving RSS items to database
- News translation management
- RSS feed settings retrieval (cooldowns, limits)
MediaExtractor (apps/rss_parser/services/media_extractor.py)
Service for extracting media content from RSS items.
Key Features:
- Image URL extraction from various RSS formats (media:thumbnail, enclosure)
- Video URL extraction with size checking
- Atom and RSS format support
Translation Services
ModelManager (services/translation/model_manager.py)
ML model manager for translations.
Key Features:
- Lazy loading of translation models
- In-memory model caching with automatic cleanup
- GPU/CPU memory management
Configuration:
TranslationService (services/translation/translation_service.py)
Main service for performing translations.
Key Features:
- Batch translation processing for performance optimization
- Text preprocessing and postprocessing
- Translation concurrency management
Configuration:
TranslationCache (services/translation/translation_cache.py)
Translation result caching.
Key Features:
- Translation caching with TTL
- Cache size limitation
- Automatic cleanup of expired entries
Configuration:
User Services
TelegramUserService (services/user/telegram_user_service.py)
Service for managing Telegram bot users and their preferences.
Key Features:
- User settings management (subscriptions, language)
- Category-based subscriber retrieval
- User language preferences
- Database operations for Telegram bot users
Interface:
WebUserService (services/user/web_user_service.py)
Service for managing web users and Telegram account linking.
Key Features:
- Telegram link code generation and validation
- Web user to Telegram user association
- Secure linking process with expiration
- Database operations for web user management
Interface:
UserManager (services/user/user_manager.py)
Backward compatibility wrapper that delegates to specialized services.
Key Features:
- Unified interface for both Telegram and web users
- Automatic delegation to appropriate service
- Maintains existing API compatibility
Interface:
Dependency Injection System
DI Container (di_container.py)
Dependency injection container for service management.
Key Features:
- Service and factory registration
- Automatic dependency resolution
- Service lifecycle management
Service Configuration (config/services_config.py)
Centralized configuration of all services through environment variables.
Interfaces (interfaces/)
Abstract interfaces for all services, providing:
- Dependency Inversion Principle
- Easy testing through mock objects
- Implementation replacement flexibility
Error Handling (exceptions/)
Hierarchy of custom exceptions for different error types:
- RSS processing errorsRSSException- translation errorsTranslationException- database errorsDatabaseException- caching errorsCacheException
Installation and Setup
Prerequisites
- Python 3.11 or higher
- PostgreSQL 12+ with pgvector extension
- Telegram Bot API token
Installing Dependencies
Basic Setup
- Copy .env.example to .env
- Configure real values for variables in .env file
Running via Scripts
Configuration
Environment Variables
Create a file in the project root directory by copying the provided .env.example file and configuring the values as needed. The file contains all available environment variables with their default values and descriptions.
Optional AI Features Configuration
FireFeed provides optional AI-powered features that can be enabled or disabled based on your needs:
TRANSLATION_ENABLED
- Default: true
- Description: Controls automatic translation of news articles to multiple languages
- Impact: When disabled, news items will only be available in their original language
- Use case: Disable to reduce computational load or when translations are not needed
DUPLICATE_DETECTOR_ENABLED
- Default: true
- Description: Controls ML-based duplicate detection using semantic analysis
- Impact: When disabled, all news items will be processed without duplicate checking
- Use case: Disable for faster processing or when duplicate detection is handled externally
RSS_PARSER_MIN_ITEM_TITLE_WORDS_LENGTH
- Default: 0
- Description: Minimum number of words required in RSS item title
- Impact: RSS items with titles containing fewer words than this threshold will be skipped
- Use case: Filter out low-quality or incomplete news items with very short titles
RSS_PARSER_MIN_ITEM_CONTENT_WORDS_LENGTH
- Default: 0
- Description: Minimum number of words required in RSS item content/description
- Impact: RSS items with content containing fewer words than this threshold will be skipped
- Use case: Filter out low-quality or incomplete news items with very short content
RSS_PARSER_CLEANUP_INTERVAL_HOURS
- Default: 0
- Description: Controls how long news items, translations, telegram publications and associated media files are kept
- Impact: When set to 0, automatic cleanup is disabled and data is stored indefinitely. When set to a positive number (e.g., 24), old data is automatically cleaned up after the specified number of hours
- Use case: Enable periodic cleanup to manage storage space and database size, or disable for permanent data retention
AI Model Configuration
FireFeed allows customization of the AI models used for translation, embeddings, and text processing:
TRANSLATION_MODEL
- Default: facebook/m2m100_418M
- Description: Specifies the translation model from Hugging Face Transformers
- Supported models: M2M100, Helsinki-NLP OPUS-MT, MarianMT, MBart, etc.
- Example:
for Helsinki-NLP modelsHelsinki-NLP/opus-mt-en-ru
EMBEDDING_SENTENCE_TRANSFORMER_MODEL
- Default: paraphrase-multilingual-MiniLM-L12-v2
- Description: Sentence transformer model for generating text embeddings
- Supported models: Any SentenceTransformer-compatible model from Hugging Face
- Example:
for faster, smaller modelall-MiniLM-L6-v2
SPACY_MODELS
- Default: {"en": "en_core_web_sm", "ru": "ru_core_news_sm", "de": "de_core_news_sm", "fr": "fr_core_news_sm"}
- Description: Unified configuration for spaCy language models used for text processing and linguistic analysis
- Supported models: Any spaCy model compatible with the language
- Example:
for transformer-based English model{"en": "en_core_web_trf", "ru": "ru_core_news_sm", "de": "de_core_news_sm", "fr": "fr_core_news_sm"}
Systemd Services
For production environments, systemd services are recommended.
RSS Parser Service ():
API Service ():
Telegram Bot Service ():
Nginx Configuration
Example configuration for webhook and FastAPI operation:
API Documentation
After starting the API server, documentation is available at:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
Development
Development Setup
Running Tests
All tests
Specific module
Stop on first failure
Short output