Implementation Details
Technical implementation details for advanced feed fetching and analytics
Source: apps/web/content/docs/development/implementation.mdx
Overview
This document describes the technical implementation of the comprehensive feed fetching and analytics system added to AI Web Feeds in version 1.0.
Architecture
The enhanced system consists of three main components:
Feed URL → AdvancedFeedFetcher → FeedMetadata + Items
↓
DatabaseManager
↓
FeedAnalytics
↓
CLI CommandsCore Components
1. Advanced Feed Fetcher
Location: packages/ai_web_feeds/src/ai_web_feeds/fetcher.py (820 lines)
A sophisticated feed fetching system that extracts exhaustive metadata from RSS/Atom/JSON feeds.
Key Features
100+ Metadata Fields
The fetcher extracts comprehensive metadata organized in groups:
Basic Feed Information:
- Title, subtitle, description
- Homepage link
- Language and copyright
- Generator information
Author/Publisher Data:
- Author name and email
- Publisher information
- Managing editor
- Webmaster contact
Visual Assets:
- Feed images (URL, title, link)
- Logo and icon URLs
- Dimensions and alt text
Technical Metadata:
- TTL (Time To Live)
- Skip hours and skip days
- Cloud configuration
- PubSubHubbub hub URLs
Content Statistics:
- Total item count
- Items with full content
- Items with authors
- Items with enclosures/media
- Average title/description/content lengths
Three-Dimensional Quality Scoring
Each feed receives scores (0-1) across three dimensions:
1. Completeness Score
Measures how complete the feed metadata is:
- ✅ Has title
- ✅ Has description
- ✅ Has link
- ✅ Has language
- ✅ Has timestamps
- ✅ Has author/publisher
- ✅ Has canonical topics and raw feed labels
- ✅ Has image/logo
# Example calculation
completeness = sum([
bool(feed.title), # 1/8
bool(feed.description), # 1/8
bool(feed.link), # 1/8
bool(feed.language), # 1/8
# ... etc
]) / 8.02. Richness Score
Measures content quality and depth:
- Items have content
- Content coverage percentage
- Author attribution
- Average content length
- Full content availability
- Media/images present
3. Structure Score
Measures feed structure quality:
- No parsing errors
- Has items
- Items have GUIDs
- Has timestamps
- Has links
Publishing Frequency Detection
Automatically analyzes item publication patterns to estimate update frequency:
| Frequency | Pattern |
|---|---|
| Hourly | New items every hour or less |
| Daily | New items published daily |
| Weekly | Weekly publication schedule |
| Monthly | Monthly updates |
| Infrequent | Longer intervals between posts |
# Algorithm outline
def estimate_update_frequency(items):
if not items or len(items) < 2:
return "unknown"
# Calculate time between publications
intervals = calculate_intervals(items)
avg_interval = median(intervals)
# Classify based on average interval
if avg_interval < 3600: # < 1 hour
return "hourly"
elif avg_interval < 86400: # < 1 day
return "daily"
# ... etcExtension Support
Full support for popular RSS extensions:
iTunes Podcast Metadata:
- Author, owner, genre labels
- Explicit flag
- Episode information
- Artwork URLs
Dublin Core Metadata:
- Contributor, coverage
- Creator, date
- Format, identifier
- Rights, source
Media RSS:
- Thumbnails with dimensions
- Media content
- Keywords and descriptions
- Credit information
GeoRSS:
- Location coordinates
- Geographic regions
- Place names
Usage Example
from ai_web_feeds.fetcher import AdvancedFeedFetcher
from ai_web_feeds.storage import DatabaseManager
# Initialize
db = DatabaseManager("sqlite:///data/ai-web-feeds.db")
fetcher = AdvancedFeedFetcher()
# Fetch feed
fetch_log, metadata, items = await fetcher.fetch_feed(
"https://example.com/feed.xml"
)
# Access quality scores
print(f"Completeness: {metadata.completeness_score:.2f}")
print(f"Richness: {metadata.richness_score:.2f}")
print(f"Structure: {metadata.structure_score:.2f}")
# Access metadata
print(f"Update frequency: {metadata.estimated_update_frequency}")
print(f"Total items: {metadata.total_items}")
print(f"Found {len(items)} items")
# Save to database
session = db.get_session()
session.add(fetch_log)
session.commit()Conditional Requests
The fetcher supports conditional HTTP requests to reduce bandwidth:
# Use ETag and Last-Modified from previous fetch
fetch_log, metadata, items = await fetcher.fetch_feed(
url="https://example.com/feed.xml",
etag="33a64df551425fcc55e4d42a148795d9f25f89d4",
last_modified="Wed, 15 Nov 2023 12:00:00 GMT"
)
# Returns 304 Not Modified if feed hasn't changed
if fetch_log.status_code == 304:
print("Feed unchanged")Retry Logic
Built-in exponential backoff for transient failures:
# Automatic retries (configured via tenacity)
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def fetch_with_retry(url):
# Will retry up to 3 times
# Waits 2s, 4s, 8s between attempts
pass2. Analytics Engine
Location: packages/ai_web_feeds/src/ai_web_feeds/analytics.py (600 lines)
Comprehensive analytics engine providing 8 different analytical views of feed data.
Generate Full Report
# Export everything to JSON
report = analytics.generate_full_report()
# Save to file
import json
with open("analytics.json", "w") as f:
json.dump(report, f, indent=2)
# Report includes all 8 analytics views3. CLI Commands
Fetch Commands
Location: apps/cli/ai_web_feeds/cli/commands/fetch.py
Fetch Single Feed
uv run ai-web-feeds fetch one <feed-id>Fetches a single feed through the current polling pipeline:
# Basic fetch
uv run ai-web-feeds fetch one openai-blogFeatures:
- Progress indicator
- Error reporting
- Article storage through the v3
articlestable - Response-time and discovered-article summary
Fetch All Feeds
uv run ai-web-feeds fetch all [--limit N] [--verified-only]Batch fetch with progress tracking:
# Fetch all feeds
uv run ai-web-feeds fetch all
# Fetch first 10 feeds
uv run ai-web-feeds fetch all --limit 10
# Fetch only verified feeds
uv run ai-web-feeds fetch all --verified-onlyFeatures:
- Rich progress bar
- Real-time stats
- Error summary table
- Success/failure counts
Analytics Commands
Location: apps/cli/ai_web_feeds/cli/commands/analytics.py (400 lines)
Summary Metrics
ai-web-feeds analytics summary --date-range 30dDisplays summary metrics:
- Total feeds and active feeds
- Validation success rate
- Average response time
- Feed health distribution
Trending Topics
ai-web-feeds analytics trending --limit 10 --date-range 30dShows topic activity:
- Topic IDs
- Feed counts
- Validation frequency
- Average health score
Publication Velocity
ai-web-feeds analytics velocity --date-range 30dPublishing metrics:
- Total article count
- Average articles per day
- Daily publication buckets
Daily Snapshot
ai-web-feeds analytics snapshotStores summary and trending-topic records for historical analytics.
CSV Export
ai-web-feeds analytics export --output reports/analytics.csvExport analytics records for reports and dashboards.
Database Schema
The enhanced system uses the existing database schema with full utilization of flexible JSON columns:
FeedFetchLog Enhancements
class FeedFetchLog(SQLModel, table=True):
# ... existing fields ...
# Enhanced usage of extra_data
extra_data: Optional[Dict[str, Any]] = Field(
default=None,
sa_column=Column(JSON)
)
# Now stores:
# - Complete HTTP headers
# - Detailed error information
# - Item statistics
# - Quality scores
# - Extension metadataArticleEntry Enhancements
class ArticleEntry(SQLModel, table=True):
__tablename__ = "articles"
topics: list[str] = Field(
default_factory=list,
sa_column=Column(JSON)
)
raw_categories: list[str] = Field(
default_factory=list,
sa_column=Column(JSON)
)
nlp_failures: dict[str, int] = Field(
default_factory=dict,
sa_column=Column(JSON)
)
# Now stores:
# - Canonical topic IDs
# - Raw feed category labels as ingress metadata
# - NLP retry/failure accountingDependencies
New Dependencies Added
Core Library Dependencies
File: packages/ai_web_feeds/pyproject.toml
dependencies = [
# ... existing ...
"beautifulsoup4>=4.12.0", # NEW: HTML parsing
]Purpose:
- HTML parsing for feed discovery
- Extracting feed URLs from web pages
- Parsing HTML content in feed items
CLI Tool Dependencies
File: apps/cli/pyproject.toml
dependencies = [
# ... existing ...
"rich>=13.7.0", # NEW: Rich terminal output
]Purpose:
- Beautiful terminal tables
- Progress bars and spinners
- Colored output and styling
- Markdown rendering in terminal
Performance Considerations
Conditional Requests
Reduce bandwidth and processing for unchanged feeds:
# Store from previous fetch
etag = fetch_log.etag
last_modified = fetch_log.last_modified
# Use in next fetch
new_log, metadata, items = await fetcher.fetch_feed(
url=feed_url,
etag=etag,
last_modified=last_modified
)
# Server returns 304 Not Modified if unchanged
if new_log.status_code == 304:
# No processing needed
returnRetry Logic
Exponential backoff for reliability:
from tenacity import (
retry,
stop_after_attempt,
wait_exponential
)
@retry(
stop=stop_after_attempt(3), # Max 3 attempts
wait=wait_exponential(
multiplier=1,
min=2, # Wait 2s after first failure
max=10 # Wait max 10s
)
)
async def fetch_with_retry(url):
# Automatic retry on failure
passTimeouts
Prevent hanging on slow feeds:
# Configurable timeout (default 30s)
fetcher = AdvancedFeedFetcher(timeout=30.0)
# Per-request timeout
fetch_log, metadata, items = await fetcher.fetch_feed(
url=feed_url,
timeout=60.0 # Override for slow feed
)Best Practices
Use Conditional Requests
Always pass etag and last_modified from previous fetches to reduce bandwidth:
# Save from previous fetch
session.add(fetch_log)
# Use in next fetch
new_log = await fetcher.fetch_feed(
url=url,
etag=fetch_log.etag,
last_modified=fetch_log.last_modified
)Respect TTL Values
Honor feed TTL (Time To Live) for update frequency:
if metadata.ttl:
# Wait TTL minutes before next fetch
next_fetch = datetime.now() + timedelta(minutes=metadata.ttl)Monitor Health Regularly
Check feed health scores to identify issues:
# Daily health check
ai-web-feeds analytics summary --topic ai
# Weekly CSV report
ai-web-feeds analytics export --output weekly-report.csvTrack Trends
Use analytics to identify patterns:
# Monthly trend analysis
ai-web-feeds analytics trending --date-range 30d
# Quality monitoring
ai-web-feeds analytics velocity --date-range 30dGenerate Periodic Reports
Export analytics for monitoring:
# Weekly reports
ai-web-feeds analytics export --output reports/week-$(date +%U).csv
# Archive for historical analysisInstallation
Quick Setup Script
Use the automated setup script:
# Make executable
chmod +x setup-enhanced-features.sh
# Run setup
./setup-enhanced-features.shThe script will:
- Install core library with dependencies
- Install CLI tool with dependencies
- Verify installation
- Display next steps
Manual Installation
Install each component separately:
# 1. Install workspace packages
uv sync
# 2. Verify installation
ai-web-feeds --version
ai-web-feeds fetch --help
ai-web-feeds analytics --helpCode Organization
packages/ai_web_feeds/src/ai_web_feeds/
├── fetcher.py # AdvancedFeedFetcher class
│ ├── FeedMetadata # Metadata container (100+ fields)
│ ├── fetch_feed() # Main fetch method
│ ├── _extract_*() # Extraction helpers
│ └── _calculate_*() # Quality scoring
│
├── analytics.py # FeedAnalytics class
│ ├── get_overview_stats()
│ ├── get_*_distribution()
│ ├── get_quality_metrics()
│ ├── get_fetch_performance_stats()
│ ├── get_content_statistics()
│ ├── get_publishing_trends()
│ ├── get_feed_health_report()
│ ├── get_top_contributors()
│ └── generate_full_report()
│
apps/cli/ai_web_feeds/cli/commands/
├── fetch.py # Fetch CLI commands
│ ├── fetch_one() # Single feed fetch
│ └── fetch_all() # Batch fetch
│
└── analytics.py # Analytics CLI commands
├── show_overview()
├── show_distributions()
├── show_quality()
├── show_performance()
├── show_content()
├── show_trends()
├── show_health()
├── show_contributors()
└── generate_report()Future Enhancements
Potential additions for future versions:
- Web UI dashboard with real-time metrics
- Machine learning for content classification
- Real-time monitoring with webhooks
- GraphQL API for analytics
- Advanced deduplication algorithms
- Content similarity analysis
- Multi-language NLP support
- Anomaly detection in publishing patterns
- Automated quality recommendations
Support
For technical questions or issues:
- Review this documentation
- Check inline code documentation
- Explore CLI help:
ai-web-feeds --help - Open an issue on GitHub
Related Documentation
- Feature Overview - High-level feature list
- Getting Started - Setup and quickstart
- Analytics Guide - Analytics usage guide