Implementation Details

Overview

This document describes the technical implementation of the comprehensive feed fetching and analytics system added to AI Web Feeds in version 1.0.

This is the first version of these capabilities - designed from scratch for optimal performance and extensibility.

Architecture

The enhanced system consists of three main components:

Feed URL → AdvancedFeedFetcher → FeedMetadata + Items
                                        ↓
                                  DatabaseManager
                                        ↓
                                  FeedAnalytics
                                        ↓
                                  CLI Commands

Core Components

1. Advanced Feed Fetcher

Location: packages/ai_web_feeds/src/ai_web_feeds/fetcher.py (820 lines)

A sophisticated feed fetching system that extracts exhaustive metadata from RSS/Atom/JSON feeds.

Key Features

100+ Metadata Fields

The fetcher extracts comprehensive metadata organized in groups:

Basic Feed Information:

Title, subtitle, description
Homepage link
Language and copyright
Generator information

Author/Publisher Data:

Author name and email
Publisher information
Managing editor
Webmaster contact

Visual Assets:

Feed images (URL, title, link)
Logo and icon URLs
Dimensions and alt text

Technical Metadata:

TTL (Time To Live)
Skip hours and skip days
Cloud configuration
PubSubHubbub hub URLs

Content Statistics:

Total item count
Items with full content
Items with authors
Items with enclosures/media
Average title/description/content lengths

Three-Dimensional Quality Scoring

Each feed receives scores (0-1) across three dimensions:

1. Completeness Score

Measures how complete the feed metadata is:

✅ Has title
✅ Has description
✅ Has link
✅ Has language
✅ Has timestamps
✅ Has author/publisher
✅ Has canonical topics and raw feed labels
✅ Has image/logo

# Example calculation
completeness = sum([
    bool(feed.title),      # 1/8
    bool(feed.description), # 1/8
    bool(feed.link),       # 1/8
    bool(feed.language),   # 1/8
    # ... etc
]) / 8.0

2. Richness Score

Measures content quality and depth:

Items have content
Content coverage percentage
Author attribution
Average content length
Full content availability
Media/images present

3. Structure Score

Measures feed structure quality:

No parsing errors
Has items
Items have GUIDs
Has timestamps
Has links

Publishing Frequency Detection

Automatically analyzes item publication patterns to estimate update frequency:

Frequency	Pattern
Hourly	New items every hour or less
Daily	New items published daily
Weekly	Weekly publication schedule
Monthly	Monthly updates
Infrequent	Longer intervals between posts

# Algorithm outline
def estimate_update_frequency(items):
    if not items or len(items) < 2:
        return "unknown"

    # Calculate time between publications
    intervals = calculate_intervals(items)
    avg_interval = median(intervals)

    # Classify based on average interval
    if avg_interval < 3600:      # < 1 hour
        return "hourly"
    elif avg_interval < 86400:   # < 1 day
        return "daily"
    # ... etc

Extension Support

Full support for popular RSS extensions:

iTunes Podcast Metadata:

Author, owner, genre labels
Explicit flag
Episode information
Artwork URLs

Dublin Core Metadata:

Contributor, coverage
Creator, date
Format, identifier
Rights, source

Media RSS:

Thumbnails with dimensions
Media content
Keywords and descriptions
Credit information

GeoRSS:

Location coordinates
Geographic regions
Place names

Usage Example

from ai_web_feeds.fetcher import AdvancedFeedFetcher
from ai_web_feeds.storage import DatabaseManager

# Initialize
db = DatabaseManager("sqlite:///data/ai-web-feeds.db")
fetcher = AdvancedFeedFetcher()

# Fetch feed
fetch_log, metadata, items = await fetcher.fetch_feed(
    "https://example.com/feed.xml"
)

# Access quality scores
print(f"Completeness: {metadata.completeness_score:.2f}")
print(f"Richness: {metadata.richness_score:.2f}")
print(f"Structure: {metadata.structure_score:.2f}")

# Access metadata
print(f"Update frequency: {metadata.estimated_update_frequency}")
print(f"Total items: {metadata.total_items}")
print(f"Found {len(items)} items")

# Save to database
session = db.get_session()
session.add(fetch_log)
session.commit()

Conditional Requests

The fetcher supports conditional HTTP requests to reduce bandwidth:

# Use ETag and Last-Modified from previous fetch
fetch_log, metadata, items = await fetcher.fetch_feed(
    url="https://example.com/feed.xml",
    etag="33a64df551425fcc55e4d42a148795d9f25f89d4",
    last_modified="Wed, 15 Nov 2023 12:00:00 GMT"
)

# Returns 304 Not Modified if feed hasn't changed
if fetch_log.status_code == 304:
    print("Feed unchanged")

Retry Logic

Built-in exponential backoff for transient failures:

# Automatic retries (configured via tenacity)
@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def fetch_with_retry(url):
    # Will retry up to 3 times
    # Waits 2s, 4s, 8s between attempts
    pass

2. Analytics Engine

Location: packages/ai_web_feeds/src/ai_web_feeds/analytics.py (600 lines)

Comprehensive analytics engine providing 8 different analytical views of feed data.

Generate Full Report

# Export everything to JSON
report = analytics.generate_full_report()

# Save to file
import json
with open("analytics.json", "w") as f:
    json.dump(report, f, indent=2)

# Report includes all 8 analytics views

3. CLI Commands

Fetch Commands

Location: apps/cli/ai_web_feeds/cli/commands/fetch.py

Fetch Single Feed

uv run ai-web-feeds fetch one <feed-id>

Fetches a single feed through the current polling pipeline:

# Basic fetch
uv run ai-web-feeds fetch one openai-blog

Features:

Progress indicator
Error reporting
Article storage through the v3 articles table
Response-time and discovered-article summary

Fetch All Feeds

uv run ai-web-feeds fetch all [--limit N] [--verified-only]

Batch fetch with progress tracking:

# Fetch all feeds
uv run ai-web-feeds fetch all

# Fetch first 10 feeds
uv run ai-web-feeds fetch all --limit 10

# Fetch only verified feeds
uv run ai-web-feeds fetch all --verified-only

Features:

Rich progress bar
Real-time stats
Error summary table
Success/failure counts

Analytics Commands

Location: apps/cli/ai_web_feeds/cli/commands/analytics.py (400 lines)

Summary Metrics

ai-web-feeds analytics summary --date-range 30d

Displays summary metrics:

Total feeds and active feeds
Validation success rate
Average response time
Feed health distribution

ai-web-feeds analytics trending --limit 10 --date-range 30d

Shows topic activity:

Topic IDs
Feed counts
Validation frequency
Average health score

Publication Velocity

ai-web-feeds analytics velocity --date-range 30d

Publishing metrics:

Total article count
Average articles per day
Daily publication buckets

Daily Snapshot

ai-web-feeds analytics snapshot

Stores summary and trending-topic records for historical analytics.

CSV Export

ai-web-feeds analytics export --output reports/analytics.csv

Export analytics records for reports and dashboards.

Database Schema

The enhanced system uses the existing database schema with full utilization of flexible JSON columns:

FeedFetchLog Enhancements

class FeedFetchLog(SQLModel, table=True):
    # ... existing fields ...

    # Enhanced usage of extra_data
    extra_data: Optional[Dict[str, Any]] = Field(
        default=None,
        sa_column=Column(JSON)
    )
    # Now stores:
    # - Complete HTTP headers
    # - Detailed error information
    # - Item statistics
    # - Quality scores
    # - Extension metadata

ArticleEntry Enhancements

class ArticleEntry(SQLModel, table=True):
    __tablename__ = "articles"

    topics: list[str] = Field(
        default_factory=list,
        sa_column=Column(JSON)
    )
    raw_categories: list[str] = Field(
        default_factory=list,
        sa_column=Column(JSON)
    )
    nlp_failures: dict[str, int] = Field(
        default_factory=dict,
        sa_column=Column(JSON)
    )
    # Now stores:
    # - Canonical topic IDs
    # - Raw feed category labels as ingress metadata
    # - NLP retry/failure accounting

The v3 contract is applied through reviewed Alembic migrations; JSON columns remain only where the public model needs structured metadata.

dependencies = [
    # ... existing ...
    "beautifulsoup4>=4.12.0",  # NEW: HTML parsing
]

Purpose:

HTML parsing for feed discovery
Extracting feed URLs from web pages
Parsing HTML content in feed items

CLI Tool Dependencies

File: apps/cli/pyproject.toml

dependencies = [
    # ... existing ...
    "rich>=13.7.0",  # NEW: Rich terminal output
]

Purpose:

Beautiful terminal tables
Progress bars and spinners
Colored output and styling
Markdown rendering in terminal

Performance Considerations

Conditional Requests

Reduce bandwidth and processing for unchanged feeds:

# Store from previous fetch
etag = fetch_log.etag
last_modified = fetch_log.last_modified

# Use in next fetch
new_log, metadata, items = await fetcher.fetch_feed(
    url=feed_url,
    etag=etag,
    last_modified=last_modified
)

# Server returns 304 Not Modified if unchanged
if new_log.status_code == 304:
    # No processing needed
    return

Retry Logic

Exponential backoff for reliability:

from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential
)

@retry(
    stop=stop_after_attempt(3),  # Max 3 attempts
    wait=wait_exponential(
        multiplier=1,
        min=2,    # Wait 2s after first failure
        max=10    # Wait max 10s
    )
)
async def fetch_with_retry(url):
    # Automatic retry on failure
    pass

Timeouts

Prevent hanging on slow feeds:

# Configurable timeout (default 30s)
fetcher = AdvancedFeedFetcher(timeout=30.0)

# Per-request timeout
fetch_log, metadata, items = await fetcher.fetch_feed(
    url=feed_url,
    timeout=60.0  # Override for slow feed
)

Best Practices

Use Conditional Requests

Always pass etag and last_modified from previous fetches to reduce bandwidth:

# Save from previous fetch
session.add(fetch_log)

# Use in next fetch
new_log = await fetcher.fetch_feed(
    url=url,
    etag=fetch_log.etag,
    last_modified=fetch_log.last_modified
)

Respect TTL Values

Honor feed TTL (Time To Live) for update frequency:

if metadata.ttl:
    # Wait TTL minutes before next fetch
    next_fetch = datetime.now() + timedelta(minutes=metadata.ttl)

Monitor Health Regularly

Check feed health scores to identify issues:

# Daily health check
ai-web-feeds analytics summary --topic ai

# Weekly CSV report
ai-web-feeds analytics export --output weekly-report.csv

Track Trends

Use analytics to identify patterns:

# Monthly trend analysis
ai-web-feeds analytics trending --date-range 30d

# Quality monitoring
ai-web-feeds analytics velocity --date-range 30d

Generate Periodic Reports

Export analytics for monitoring:

# Weekly reports
ai-web-feeds analytics export --output reports/week-$(date +%U).csv

# Archive for historical analysis

Installation

Quick Setup Script

Use the automated setup script:

# Make executable
chmod +x setup-enhanced-features.sh

# Run setup
./setup-enhanced-features.sh

The script will:

Install core library with dependencies
Install CLI tool with dependencies
Verify installation
Display next steps

Manual Installation

Install each component separately:

# 1. Install workspace packages
uv sync

# 2. Verify installation
ai-web-feeds --version
ai-web-feeds fetch --help
ai-web-feeds analytics --help

Code Organization

packages/ai_web_feeds/src/ai_web_feeds/
├── fetcher.py          # AdvancedFeedFetcher class
│   ├── FeedMetadata    # Metadata container (100+ fields)
│   ├── fetch_feed()    # Main fetch method
│   ├── _extract_*()    # Extraction helpers
│   └── _calculate_*()  # Quality scoring
│
├── analytics.py        # FeedAnalytics class
│   ├── get_overview_stats()
│   ├── get_*_distribution()
│   ├── get_quality_metrics()
│   ├── get_fetch_performance_stats()
│   ├── get_content_statistics()
│   ├── get_publishing_trends()
│   ├── get_feed_health_report()
│   ├── get_top_contributors()
│   └── generate_full_report()
│
apps/cli/ai_web_feeds/cli/commands/
├── fetch.py            # Fetch CLI commands
│   ├── fetch_one()     # Single feed fetch
│   └── fetch_all()     # Batch fetch
│
└── analytics.py        # Analytics CLI commands
    ├── show_overview()
    ├── show_distributions()
    ├── show_quality()
    ├── show_performance()
    ├── show_content()
    ├── show_trends()
    ├── show_health()
    ├── show_contributors()
    └── generate_report()

Future Enhancements

Potential additions for future versions:

Web UI dashboard with real-time metrics
Machine learning for content classification
Real-time monitoring with webhooks
GraphQL API for analytics
Advanced deduplication algorithms
Content similarity analysis
Multi-language NLP support
Anomaly detection in publishing patterns
Automated quality recommendations

Support

For technical questions or issues:

Review this documentation
Check inline code documentation
Explore CLI help: ai-web-feeds --help
Open an issue on GitHub

Feature Overview - High-level feature list
Getting Started - Setup and quickstart
Analytics Guide - Analytics usage guide

Implementation Details

1. Overview Statistics

2. Distribution Analysis

3. Quality Metrics

4. Performance Tracking

5. Content Statistics

6. Publishing Trends

7. Feed Health Reports

8. Contributor Analytics

On this page