AI Web FeedsAI Web FeedsOpen web AI reader
  • Guides
    Documentation

    Feed Schema Reference

    Complete reference for the feeds.yaml schema

    Source: apps/web/content/docs/guides/feed-schema.mdx

    Feed Schema Reference

    Complete reference documentation for the feeds.yaml schema (feeds-1.0.0).

    Overview

    The feed schema balances contributor ergonomics with strict machine validation. It supports:

    • Direct feed URLs or site-based discovery
    • Canonical topic classification
    • Platform-specific configurations (Reddit, Medium, YouTube, etc.)
    • Rich metadata and curation tracking
    • Cross-feed relationships and deduplication

    Schema Location: data/feeds.schema.json

    Top-Level Structure

    schema_version: "feeds-1.0.0"
    
    document_meta:
      created: "2025-10-15"
      updated: "2025-10-15"
      generated_with:
        tool: "aiwebfeeds-cli"
        version: "0.1.0"
      notes: "Optional description"
    
    sources:
      - id: "feed-1"
        feed: "https://example.com/feed.xml"
        # ... feed properties
    
      - id: "feed-2"
        site: "https://example.org"
        discover: true
        # ... feed properties

    Alternative: Grouped Structure

    schema_version: "feeds-1.0.0"
    
    groups:
      OpenAI:
        - id: "openai-blog"
          feed: "https://openai.com/blog/rss/"
          # ...
    
      HuggingFace:
        - id: "hf-blog"
          feed: "huggingface/blog"
          # ...

    Required Properties

    Every source record must include one of:

    PropertyTypeDescription
    feedStringDirect feed URL, alias, or CURIE
    siteStringHomepage URL (triggers discovery)
    discoverBoolean/ObjectDiscovery configuration

    Additional Requirements

    • id - Recommended for stable references
    • At least one topic from the canonical list

    Feed Source Properties

    Core Identification

    id

    Stable unique identifier (slug format).

    id: "example-blog"

    Rules:

    • Pattern: ^[a-z0-9._-]+$
    • Lowercase alphanumeric, dots, underscores, hyphens
    • Should be stable (don't change once published)

    feed

    Direct feed URL, short alias, or CURIE reference.

    Examples:

    # Direct URL
    feed: "https://openai.com/blog/rss/"
    
    # Short alias (resolved via data/feed_aliases.yaml)
    feed: "huggingface/blog"
    
    # CURIE reference
    feed: "wikidata:Q2539"

    Formats:

    • HTTP(S) URLs: ^https?://
    • Aliases: ^[a-z0-9._-]+/[a-z0-9._-]+$
    • CURIEs: ^[a-z][a-z0-9._-]*:[^\s]+$

    site

    Homepage or section URL. When provided without feed, triggers discovery.

    site: "https://example.com/blog"
    discover: true

    Rules:

    • Must be valid HTTP(S) URL
    • Used for discovery if feed is not provided
    • Can coexist with feed for cross-reference

    title

    Descriptive title for the feed.

    title: "OpenAI Blog - Latest Research"

    Rules:

    • Min length: 1
    • Max length: 160 characters
    • Should be clear and descriptive

    Discovery Configuration

    discover

    Controls automatic feed discovery.

    Simple Boolean:

    discover: true   # Enable default discovery
    discover: false  # Disable discovery

    Advanced Object:

    discover:
      backend: "default" # default | feedparser | rsshub | browserless
      strategy: "html-link" # auto | html-link | rsshub | well-known
      strategy_detail: "Optional hint for tuning"
      hints: ["rss", "atom", "blog"]
      limit: 3

    Properties:

    PropertyTypeDescription
    backendStringDiscovery backend engine
    strategyStringDiscovery method
    strategy_detailStringFreeform hint (max 160 chars)
    hintsArraySearch keywords (max 8)
    limitIntegerMax feeds to find (1-10)

    Topics and Classification

    topics

    Array of 1-6 canonical topic IDs from data/topics.yaml.

    topics: ["ml", "nlp", "open-source"]

    Rules:

    • Min items: 1
    • Max items: 6
    • Each ID must match: ^[a-z0-9]+(?:-[a-z0-9]+)*$
    • Must exist in canonical topics list

    Common Topics:

    IDDescription
    mlMachine Learning
    nlpNatural Language Processing
    cvComputer Vision
    rlReinforcement Learning
    llmLarge Language Models
    researchAcademic Research
    industryIndustry News
    open-sourceOpen Source Projects

    topic_weights

    Optional relevance weights per topic (0-1 scale).

    topic_weights:
      ml: 0.95
      nlp: 0.80
      open-source: 0.60

    Rules:

    • Keys must be valid topic IDs
    • Values: 0.0 to 1.0
    • Higher = more relevant

    Content Classification

    source_type

    Primary source type.

    source_type: "blog"

    Valid Types:

    TypeDescription
    blogBlog or article site
    newsletterEmail newsletter
    podcastAudio podcast
    journalAcademic journal
    preprintPreprint server (arXiv, etc.)
    organizationCompany/org announcements
    aggregatorNews aggregator
    videoVideo platform
    docsDocumentation site
    forumDiscussion forum
    datasetDataset repository
    code-repoCode repository
    newsroomNews organization
    educationEducational content
    redditReddit community
    mediumMedium publication
    youtubeYouTube channel
    githubGitHub repository
    substackSubstack newsletter
    devtoDev.to publication
    hackernewsHacker News

    mediums

    Content modalities (max 5).

    mediums: ["text", "code", "video"]

    Valid Values:

    • text - Written content
    • audio - Podcasts, audio recordings
    • video - Video content
    • code - Source code, notebooks
    • data - Datasets, data files

    tags

    Freeform tags for filtering (max 12).

    tags: ["official", "community", "tutorials"]

    Rules:

    • Pattern: ^[a-z0-9-]{1,32}$
    • Lowercase, alphanumeric, hyphens
    • Max 12 tags
    • Unique values

    Platform-Specific Configuration

    platform_config

    Platform-specific settings for Reddit, Medium, YouTube, GitHub, etc.

    Reddit Example:

    platform_config:
      platform: "reddit"
      reddit:
        subreddit: "MachineLearning"
        sort: "hot" # hot | new | top | rising
        time: "day" # hour | day | week | month | year | all

    Medium Example:

    platform_config:
      platform: "medium"
      medium:
        publication: "towards-data-science"
        # OR
        username: "@username"
        # OR
        tag: "machine-learning"

    YouTube Example:

    platform_config:
      platform: "youtube"
      youtube:
        channel_id: "UCbfYPyITQ-7l4upoX8nvctg"
        # OR
        playlist_id: "PLrAXtmErZgOeiKm4sgNOknGvNjby9efdf"
        # OR
        username: "TwoMinutePapers"

    GitHub Example:

    platform_config:
      platform: "github"
      github:
        owner: "pytorch"
        repo: "pytorch"
        feed_type: "releases" # releases | commits | tags | activity
        branch: "main" # optional, for commits

    Substack Example:

    platform_config:
      platform: "substack"
      substack:
        publication: "importai"

    Dev.to Example:

    platform_config:
      platform: "devto"
      devto:
        username: "username"
        # OR
        organization: "org-name"
        # OR
        tag: "machinelearning"

    Hacker News Example:

    platform_config:
      platform: "hackernews"
      hackernews:
        username: "pg"
        # OR
        feed_type: "frontpage" # frontpage | newest | best | ask | show | jobs

    Metadata

    meta

    Feed-level metadata.

    meta:
      language: "en"
      format: "rss"
      updated: "2025-10-15"
      last_validated: "2025-10-15"
      verified: true
      contributor: "wyattowalsh"

    Properties:

    PropertyTypeDescription
    languageStringIETF BCP-47 code (e.g., 'en', 'en-US')
    formatStringrss | atom | jsonfeed | unknown
    updatedStringLast human review date (YYYY-MM-DD)
    last_validatedStringLast automated validation (YYYY-MM-DD)
    verifiedBooleanTrust/accuracy flag
    contributorStringGitHub username (1-80 chars)

    Curation

    curation

    Curation status and quality metrics.

    curation:
      status: "verified"
      since: "2025-10-15"
      by: "wyattowalsh"
      quality_score: 0.95
      notes: "High-quality official blog"

    Properties:

    PropertyTypeDescription
    statusStringverified | unverified | archived | experimental | inactive
    sinceStringStatus assignment date (YYYY-MM-DD)
    byStringCurator GitHub username
    quality_scoreNumber0.0 to 1.0 quality rating
    notesStringCuration notes (max 500 chars)

    Relationships

    relations

    Typed relationships between feeds.

    relations:
      mirror_of: "https://example.com/feed.json"
      derived_from: "example/parent"
      syndicates:
        - "https://feedburner.com/example"
        - "https://medium.com/feed/example"
      related_feeds:
        - "https://example.org/related.xml"

    Properties:

    PropertyTypeDescription
    mirror_ofStringFeed is a mirror/copy of another
    derived_fromStringFeed is derived from another source
    syndicatesArraySyndicated to these feeds (max 8)
    related_feedsArrayRelated feeds (max 8)

    Provenance

    provenance

    Origin and licensing information.

    provenance:
      source: "manual" # manual | automation | import
      from: "https://example.com"
      license: "CC-BY-4.0"

    External Mappings

    mappings

    Links to external identifiers.

    mappings:
      schema_org: "https://schema.org/Blog"
      wikidata: "Q123456"
      huggingface: "datasets/example"
      crossref: "10.1234/example"

    Extensions

    extensions

    Forward-compatible custom fields.

    extensions:
      custom_field: "value"
      analytics:
        subscribers: 10000
        avg_posts_per_week: 3

    Rules:

    • Any structure allowed
    • Reserved for future features
    • Won't cause validation errors

    Notes

    notes

    Freeform notes about the feed.

    notes: "Official blog with weekly ML research summaries"

    Rules:

    • Max 500 characters
    • Markdown not supported

    Complete Examples

    Minimal Feed Entry

    - id: "example-minimal"
      feed: "https://example.com/feed.xml"
      topics: ["ml"]

    Comprehensive Feed Entry

    - id: "huggingface-blog"
      feed: "huggingface/blog"
      site: "https://huggingface.co/blog"
      title: "Hugging Face Blog"
    
      topics: ["open-source", "nlp", "ml"]
      topic_weights:
        open-source: 0.95
        nlp: 0.90
        ml: 0.80
    
      source_type: "blog"
      mediums: ["text", "code"]
      tags: ["official", "community", "tutorials"]
    
      meta:
        language: "en"
        format: "rss"
        updated: "2025-10-15"
        verified: true
        contributor: "wyattowalsh"
    
      curation:
        status: "verified"
        since: "2025-10-15"
        by: "wyattowalsh"
        quality_score: 0.98
        notes: "High-quality official blog"
    
      provenance:
        source: "manual"
        from: "https://huggingface.co"
        license: "CC-BY-4.0"
    
      mappings:
        wikidata: "Q107561822"
    
      notes: "Official Hugging Face blog with ML tutorials and research"

    Discovery-Based Entry

    - id: "arxiv-cs-ai"
      site: "https://arxiv.org/list/cs.AI/recent"
      discover:
        backend: "default"
        strategy: "html-link"
        hints: ["rss", "atom"]
        limit: 3
      title: "arXiv: Artificial Intelligence"
      topics: ["research", "papers", "ml"]
      source_type: "preprint"
      mediums: ["text", "data"]

    Platform-Specific Entry (Reddit)

    - id: "machinelearning-subreddit"
      site: "https://www.reddit.com/r/MachineLearning"
      title: "r/MachineLearning"
      topics: ["ml", "community"]
      source_type: "reddit"
    
      platform_config:
        platform: "reddit"
        reddit:
          subreddit: "MachineLearning"
          sort: "hot"
    
      meta:
        language: "en"
        updated: "2025-10-15"
    
      notes: "Active ML community discussions"

    Validation

    Schema Validation

    # Validate with Python through uv
    uv run python -c "
    import json, yaml
    from jsonschema import validate
    
    with open('data/feeds.yaml') as f:
        feeds = yaml.safe_load(f)
    with open('data/feeds.schema.json') as f:
        schema = json.load(f)
    
    validate(instance=feeds, schema=schema)
    print('✅ Valid')
    "

    Common Validation Errors

    Error: "Additional properties are not allowed"

    You've included a field not in the schema. Check spelling and nesting.

    Error: "'topics' is a required property"

    Every feed must have at least one topic.

    Error: "Pattern mismatch for 'id'"

    Feed IDs must be lowercase alphanumeric with hyphens/underscores/dots only.

    Error: "Maximum items exceeded for 'topics'"

    Limit to 6 topics maximum.

    Best Practices

    Choosing Feed vs Site

    Use feed when:

    • You know the exact feed URL
    • The feed is stable and unlikely to change
    • You want maximum reliability

    Use site + discover when:

    • Feed URL is unknown
    • Site may have multiple feeds
    • You want automatic feed updates

    Topic Selection

    1. Be Specific - Choose the most relevant topics
    2. Limit Count - 1-3 topics is usually sufficient
    3. Use Weights - Add topic_weights for fine-tuning
    4. Check Canonical List - Ensure topics exist in data/topics.yaml

    Quality Guidelines

    High-Quality Entries:

    • ✅ Accurate, verified feed URLs
    • ✅ Descriptive titles
    • ✅ Relevant topics with weights
    • ✅ Complete metadata
    • ✅ Curation status set
    • ✅ Notes explaining value

    Avoid:

    • ❌ Generic titles
    • ❌ Too many topics
    • ❌ Unverified feeds
    • ❌ Missing contributor info
    Feed Schema Reference | AI Web Feeds