AI Web FeedsAI Web FeedsOpen web AI reader
  • Documentation

    Implementation Details

    Technical implementation details for advanced feed fetching and analytics

    Source: apps/web/content/docs/development/implementation.mdx

    Overview

    This document describes the technical implementation of the comprehensive feed fetching and analytics system added to AI Web Feeds in version 1.0.

    This is the first version of these capabilities - designed from scratch for optimal performance and extensibility.

    Architecture

    The enhanced system consists of three main components:

    Feed URL → AdvancedFeedFetcher → FeedMetadata + Items
    
                                      DatabaseManager
    
                                      FeedAnalytics
    
                                      CLI Commands

    Core Components

    1. Advanced Feed Fetcher

    Location: packages/ai_web_feeds/src/ai_web_feeds/fetcher.py (820 lines)

    A sophisticated feed fetching system that extracts exhaustive metadata from RSS/Atom/JSON feeds.

    Key Features

    100+ Metadata Fields

    The fetcher extracts comprehensive metadata organized in groups:

    Basic Feed Information:

    • Title, subtitle, description
    • Homepage link
    • Language and copyright
    • Generator information

    Author/Publisher Data:

    • Author name and email
    • Publisher information
    • Managing editor
    • Webmaster contact

    Visual Assets:

    • Feed images (URL, title, link)
    • Logo and icon URLs
    • Dimensions and alt text

    Technical Metadata:

    • TTL (Time To Live)
    • Skip hours and skip days
    • Cloud configuration
    • PubSubHubbub hub URLs

    Content Statistics:

    • Total item count
    • Items with full content
    • Items with authors
    • Items with enclosures/media
    • Average title/description/content lengths

    Three-Dimensional Quality Scoring

    Each feed receives scores (0-1) across three dimensions:

    1. Completeness Score

    Measures how complete the feed metadata is:

    • ✅ Has title
    • ✅ Has description
    • ✅ Has link
    • ✅ Has language
    • ✅ Has timestamps
    • ✅ Has author/publisher
    • ✅ Has canonical topics and raw feed labels
    • ✅ Has image/logo
    # Example calculation
    completeness = sum([
        bool(feed.title),      # 1/8
        bool(feed.description), # 1/8
        bool(feed.link),       # 1/8
        bool(feed.language),   # 1/8
        # ... etc
    ]) / 8.0

    2. Richness Score

    Measures content quality and depth:

    • Items have content
    • Content coverage percentage
    • Author attribution
    • Average content length
    • Full content availability
    • Media/images present

    3. Structure Score

    Measures feed structure quality:

    • No parsing errors
    • Has items
    • Items have GUIDs
    • Has timestamps
    • Has links

    Publishing Frequency Detection

    Automatically analyzes item publication patterns to estimate update frequency:

    FrequencyPattern
    HourlyNew items every hour or less
    DailyNew items published daily
    WeeklyWeekly publication schedule
    MonthlyMonthly updates
    InfrequentLonger intervals between posts
    # Algorithm outline
    def estimate_update_frequency(items):
        if not items or len(items) < 2:
            return "unknown"
    
        # Calculate time between publications
        intervals = calculate_intervals(items)
        avg_interval = median(intervals)
    
        # Classify based on average interval
        if avg_interval < 3600:      # < 1 hour
            return "hourly"
        elif avg_interval < 86400:   # < 1 day
            return "daily"
        # ... etc

    Extension Support

    Full support for popular RSS extensions:

    iTunes Podcast Metadata:

    • Author, owner, genre labels
    • Explicit flag
    • Episode information
    • Artwork URLs

    Dublin Core Metadata:

    • Contributor, coverage
    • Creator, date
    • Format, identifier
    • Rights, source

    Media RSS:

    • Thumbnails with dimensions
    • Media content
    • Keywords and descriptions
    • Credit information

    GeoRSS:

    • Location coordinates
    • Geographic regions
    • Place names

    Usage Example

    from ai_web_feeds.fetcher import AdvancedFeedFetcher
    from ai_web_feeds.storage import DatabaseManager
    
    # Initialize
    db = DatabaseManager("sqlite:///data/ai-web-feeds.db")
    fetcher = AdvancedFeedFetcher()
    
    # Fetch feed
    fetch_log, metadata, items = await fetcher.fetch_feed(
        "https://example.com/feed.xml"
    )
    
    # Access quality scores
    print(f"Completeness: {metadata.completeness_score:.2f}")
    print(f"Richness: {metadata.richness_score:.2f}")
    print(f"Structure: {metadata.structure_score:.2f}")
    
    # Access metadata
    print(f"Update frequency: {metadata.estimated_update_frequency}")
    print(f"Total items: {metadata.total_items}")
    print(f"Found {len(items)} items")
    
    # Save to database
    session = db.get_session()
    session.add(fetch_log)
    session.commit()

    Conditional Requests

    The fetcher supports conditional HTTP requests to reduce bandwidth:

    # Use ETag and Last-Modified from previous fetch
    fetch_log, metadata, items = await fetcher.fetch_feed(
        url="https://example.com/feed.xml",
        etag="33a64df551425fcc55e4d42a148795d9f25f89d4",
        last_modified="Wed, 15 Nov 2023 12:00:00 GMT"
    )
    
    # Returns 304 Not Modified if feed hasn't changed
    if fetch_log.status_code == 304:
        print("Feed unchanged")

    Retry Logic

    Built-in exponential backoff for transient failures:

    # Automatic retries (configured via tenacity)
    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=2, max=10)
    )
    async def fetch_with_retry(url):
        # Will retry up to 3 times
        # Waits 2s, 4s, 8s between attempts
        pass

    2. Analytics Engine

    Location: packages/ai_web_feeds/src/ai_web_feeds/analytics.py (600 lines)

    Comprehensive analytics engine providing 8 different analytical views of feed data.

    Generate Full Report

    # Export everything to JSON
    report = analytics.generate_full_report()
    
    # Save to file
    import json
    with open("analytics.json", "w") as f:
        json.dump(report, f, indent=2)
    
    # Report includes all 8 analytics views

    3. CLI Commands

    Fetch Commands

    Location: apps/cli/ai_web_feeds/cli/commands/fetch.py

    Fetch Single Feed

    uv run ai-web-feeds fetch one <feed-id>

    Fetches a single feed through the current polling pipeline:

    # Basic fetch
    uv run ai-web-feeds fetch one openai-blog

    Features:

    • Progress indicator
    • Error reporting
    • Article storage through the v3 articles table
    • Response-time and discovered-article summary

    Fetch All Feeds

    uv run ai-web-feeds fetch all [--limit N] [--verified-only]

    Batch fetch with progress tracking:

    # Fetch all feeds
    uv run ai-web-feeds fetch all
    
    # Fetch first 10 feeds
    uv run ai-web-feeds fetch all --limit 10
    
    # Fetch only verified feeds
    uv run ai-web-feeds fetch all --verified-only

    Features:

    • Rich progress bar
    • Real-time stats
    • Error summary table
    • Success/failure counts

    Analytics Commands

    Location: apps/cli/ai_web_feeds/cli/commands/analytics.py (400 lines)

    Summary Metrics

    ai-web-feeds analytics summary --date-range 30d

    Displays summary metrics:

    • Total feeds and active feeds
    • Validation success rate
    • Average response time
    • Feed health distribution
    ai-web-feeds analytics trending --limit 10 --date-range 30d

    Shows topic activity:

    • Topic IDs
    • Feed counts
    • Validation frequency
    • Average health score

    Publication Velocity

    ai-web-feeds analytics velocity --date-range 30d

    Publishing metrics:

    • Total article count
    • Average articles per day
    • Daily publication buckets

    Daily Snapshot

    ai-web-feeds analytics snapshot

    Stores summary and trending-topic records for historical analytics.

    CSV Export

    ai-web-feeds analytics export --output reports/analytics.csv

    Export analytics records for reports and dashboards.

    Database Schema

    The enhanced system uses the existing database schema with full utilization of flexible JSON columns:

    FeedFetchLog Enhancements

    class FeedFetchLog(SQLModel, table=True):
        # ... existing fields ...
    
        # Enhanced usage of extra_data
        extra_data: Optional[Dict[str, Any]] = Field(
            default=None,
            sa_column=Column(JSON)
        )
        # Now stores:
        # - Complete HTTP headers
        # - Detailed error information
        # - Item statistics
        # - Quality scores
        # - Extension metadata

    ArticleEntry Enhancements

    class ArticleEntry(SQLModel, table=True):
        __tablename__ = "articles"
    
        topics: list[str] = Field(
            default_factory=list,
            sa_column=Column(JSON)
        )
        raw_categories: list[str] = Field(
            default_factory=list,
            sa_column=Column(JSON)
        )
        nlp_failures: dict[str, int] = Field(
            default_factory=dict,
            sa_column=Column(JSON)
        )
        # Now stores:
        # - Canonical topic IDs
        # - Raw feed category labels as ingress metadata
        # - NLP retry/failure accounting
    The v3 contract is applied through reviewed Alembic migrations; JSON columns remain only where the public model needs structured metadata.

    Dependencies

    New Dependencies Added

    Core Library Dependencies

    File: packages/ai_web_feeds/pyproject.toml

    dependencies = [
        # ... existing ...
        "beautifulsoup4>=4.12.0",  # NEW: HTML parsing
    ]

    Purpose:

    • HTML parsing for feed discovery
    • Extracting feed URLs from web pages
    • Parsing HTML content in feed items

    CLI Tool Dependencies

    File: apps/cli/pyproject.toml

    dependencies = [
        # ... existing ...
        "rich>=13.7.0",  # NEW: Rich terminal output
    ]

    Purpose:

    • Beautiful terminal tables
    • Progress bars and spinners
    • Colored output and styling
    • Markdown rendering in terminal

    Performance Considerations

    Conditional Requests

    Reduce bandwidth and processing for unchanged feeds:

    # Store from previous fetch
    etag = fetch_log.etag
    last_modified = fetch_log.last_modified
    
    # Use in next fetch
    new_log, metadata, items = await fetcher.fetch_feed(
        url=feed_url,
        etag=etag,
        last_modified=last_modified
    )
    
    # Server returns 304 Not Modified if unchanged
    if new_log.status_code == 304:
        # No processing needed
        return

    Retry Logic

    Exponential backoff for reliability:

    from tenacity import (
        retry,
        stop_after_attempt,
        wait_exponential
    )
    
    @retry(
        stop=stop_after_attempt(3),  # Max 3 attempts
        wait=wait_exponential(
            multiplier=1,
            min=2,    # Wait 2s after first failure
            max=10    # Wait max 10s
        )
    )
    async def fetch_with_retry(url):
        # Automatic retry on failure
        pass

    Timeouts

    Prevent hanging on slow feeds:

    # Configurable timeout (default 30s)
    fetcher = AdvancedFeedFetcher(timeout=30.0)
    
    # Per-request timeout
    fetch_log, metadata, items = await fetcher.fetch_feed(
        url=feed_url,
        timeout=60.0  # Override for slow feed
    )

    Best Practices

    Use Conditional Requests

    Always pass etag and last_modified from previous fetches to reduce bandwidth:

    # Save from previous fetch
    session.add(fetch_log)
    
    # Use in next fetch
    new_log = await fetcher.fetch_feed(
        url=url,
        etag=fetch_log.etag,
        last_modified=fetch_log.last_modified
    )

    Respect TTL Values

    Honor feed TTL (Time To Live) for update frequency:

    if metadata.ttl:
        # Wait TTL minutes before next fetch
        next_fetch = datetime.now() + timedelta(minutes=metadata.ttl)

    Monitor Health Regularly

    Check feed health scores to identify issues:

    # Daily health check
    ai-web-feeds analytics summary --topic ai
    
    # Weekly CSV report
    ai-web-feeds analytics export --output weekly-report.csv

    Use analytics to identify patterns:

    # Monthly trend analysis
    ai-web-feeds analytics trending --date-range 30d
    
    # Quality monitoring
    ai-web-feeds analytics velocity --date-range 30d

    Generate Periodic Reports

    Export analytics for monitoring:

    # Weekly reports
    ai-web-feeds analytics export --output reports/week-$(date +%U).csv
    
    # Archive for historical analysis

    Installation

    Quick Setup Script

    Use the automated setup script:

    # Make executable
    chmod +x setup-enhanced-features.sh
    
    # Run setup
    ./setup-enhanced-features.sh

    The script will:

    1. Install core library with dependencies
    2. Install CLI tool with dependencies
    3. Verify installation
    4. Display next steps

    Manual Installation

    Install each component separately:

    # 1. Install workspace packages
    uv sync
    
    # 2. Verify installation
    ai-web-feeds --version
    ai-web-feeds fetch --help
    ai-web-feeds analytics --help

    Code Organization

    packages/ai_web_feeds/src/ai_web_feeds/
    ├── fetcher.py          # AdvancedFeedFetcher class
    │   ├── FeedMetadata    # Metadata container (100+ fields)
    │   ├── fetch_feed()    # Main fetch method
    │   ├── _extract_*()    # Extraction helpers
    │   └── _calculate_*()  # Quality scoring
    
    ├── analytics.py        # FeedAnalytics class
    │   ├── get_overview_stats()
    │   ├── get_*_distribution()
    │   ├── get_quality_metrics()
    │   ├── get_fetch_performance_stats()
    │   ├── get_content_statistics()
    │   ├── get_publishing_trends()
    │   ├── get_feed_health_report()
    │   ├── get_top_contributors()
    │   └── generate_full_report()
    
    apps/cli/ai_web_feeds/cli/commands/
    ├── fetch.py            # Fetch CLI commands
    │   ├── fetch_one()     # Single feed fetch
    │   └── fetch_all()     # Batch fetch
    
    └── analytics.py        # Analytics CLI commands
        ├── show_overview()
        ├── show_distributions()
        ├── show_quality()
        ├── show_performance()
        ├── show_content()
        ├── show_trends()
        ├── show_health()
        ├── show_contributors()
        └── generate_report()

    Future Enhancements

    Potential additions for future versions:

    • Web UI dashboard with real-time metrics
    • Machine learning for content classification
    • Real-time monitoring with webhooks
    • GraphQL API for analytics
    • Advanced deduplication algorithms
    • Content similarity analysis
    • Multi-language NLP support
    • Anomaly detection in publishing patterns
    • Automated quality recommendations

    Support

    For technical questions or issues:

    1. Review this documentation
    2. Check inline code documentation
    3. Explore CLI help: ai-web-feeds --help
    4. Open an issue on GitHub
    Implementation Details | AI Web Feeds