llms-full.txt Format

The /llms-full.txt endpoint provides a comprehensive, structured format optimized for AI agents and RAG systems.

Overview

The enhanced format includes:

Metadata header with generation info
Table of contents for navigation
Structured page sections with clear separators
Individual metadata for each page
AI-friendly formatting for easy parsing

This format is designed to be both human-readable and machine-parsable, making it ideal for RAG systems, embeddings, and AI analysis.

Format Structure

The document follows this hierarchical structure:

================================================================================
HEADER SECTION
================================================================================
├── Metadata (date, page count, base URL)
├── Description
├── Structure explanation
└── Table of Contents

================================================================================
DOCUMENTATION CONTENT
================================================================================
├── PAGE 1
│   ├── Page metadata (title, URL, description, path)
│   ├── Content separator
│   ├── Full markdown content
│   └── End marker
├── PAGE 2
│   └── ...
└── PAGE N

================================================================================
FOOTER SECTION
================================================================================
└── Summary and access information

Header Section

Metadata Block

Essential information about the documentation:

================================================================================
AI WEB FEEDS - COMPLETE DOCUMENTATION
================================================================================

METADATA
--------------------------------------------------------------------------------
Generated: 2025-10-14T12:00:00.000Z
Total Pages: 5
Base URL: https://yourdomain.com
Format: Markdown
Encoding: UTF-8

Description Block

Project overview for context:

DESCRIPTION
--------------------------------------------------------------------------------
A comprehensive collection of curated RSS/Atom feeds optimized for AI agents
and large language models. This document contains the complete documentation
for the AI Web Feeds project, including setup guides, API references, and
usage examples.

Structure Explanation

Format guide for parsers:

STRUCTURE
--------------------------------------------------------------------------------
Each page section follows this format:
  - Page separator (===)
  - Page number (X OF Y)
  - Page metadata (title, URL, description, path)
  - Content separator (---)
  - Full markdown content

Complete navigation index:

NAVIGATION
--------------------------------------------------------------------------------
Table of Contents:

  1. Getting Started - /docs
  2. PDF Export - /docs/features/pdf-export
  3. AI Integration - /docs/features/ai-integration
  4. Testing Guide - /docs/guides/testing
  5. Quick Reference - /docs/guides/quick-reference

================================================================================
DOCUMENTATION CONTENT
================================================================================

Page Section Format

Each page follows a consistent structure:

================================================================================
PAGE 1 OF 5
================================================================================

TITLE: Getting Started
URL: https://yourdomain.com/docs
MARKDOWN: https://yourdomain.com/docs.mdx
DESCRIPTION: Quick start guide for AI Web Feeds
PATH: /

--------------------------------------------------------------------------------
CONTENT
--------------------------------------------------------------------------------

# Getting Started

[Full markdown content of the page...]

--------------------------------------------------------------------------------
END OF PAGE 1
--------------------------------------------------------------------------------

Page Metadata Fields

Field	Description	Example
`TITLE`	Page title	`Getting Started`
`URL`	Full page URL	`https://yourdomain.com/docs`
`MARKDOWN`	Markdown endpoint	`https://yourdomain.com/docs.mdx`
`DESCRIPTION`	Page description	`Quick start guide...`
`PATH`	Relative path	`/`

Summary and access instructions:

================================================================================
END OF DOCUMENTATION
================================================================================

Total pages processed: 5
Generated: 2025-10-14T12:00:00.000Z
Format: Plain text with markdown content

For individual pages, append .mdx to any documentation URL.
For the discovery file, visit /llms.txt

================================================================================

Benefits for AI Agents

Clear Structure

Consistent separators - 80-character wide = and - lines
Numbered pages - PAGE X OF Y format
Hierarchical organization - Header → Content → Footer
Predictable format - Easy to parse with regex

Rich Metadata

Generation timestamp - Know when docs were created
Total page count - Plan context window usage
Base URL - Resolve relative links
Per-page metadata - Title, URL, description, path

Multiple Access Patterns

Complete documentation - Single request for all content
Table of contents - Quick overview of structure
Individual pages - URLs for targeted access
Markdown endpoints - Source content links

Parser-Friendly

Fixed-width separators - 80 characters for consistency
Clear section markers - Unmistakable boundaries
Predictable structure - Same format every time
UTF-8 encoding - Universal character support

HTTP Headers

Enhanced response headers provide additional metadata:

Content-Type: text/plain; charset=utf-8
Cache-Control: public, max-age=0, must-revalidate
X-Content-Pages: 5
X-Generated-Date: 2025-10-14T12:00:00.000Z

Custom headers allow clients to access metadata without parsing the document body.

Usage Examples

RAG System Integration

import requests

# Fetch complete documentation
response = requests.get('https://yourdomain.com/llms-full.txt')
content = response.text

# Parse metadata from headers
total_pages = int(response.headers['X-Content-Pages'])
generated = response.headers['X-Generated-Date']

# Split by page separators
separator = '=' * 80 + '\nPAGE '
pages = content.split(separator)

# Extract table of contents
toc_start = content.find('Table of Contents:')
toc_end = content.find('=' * 80 + '\nDOCUMENTATION CONTENT')
toc = content[toc_start:toc_end]

# Process individual pages
for i, page in enumerate(pages[1:], 1):
    if 'TITLE:' in page:
        # Extract page metadata
        title = page.split('TITLE: ')[1].split('\n')[0]
        url = page.split('URL: ')[1].split('\n')[0]

        # Extract content
        content_start = page.find('CONTENT\n' + '-' * 80 + '\n\n')
        content_end = page.find('\n\n' + '-' * 80 + '\nEND OF PAGE')
        content = page[content_start:content_end]

        print(f"Page {i}: {title}")

// Fetch complete documentation
const response = await fetch('https://yourdomain.com/llms-full.txt');
const content = await response.text();

// Parse metadata from headers
const totalPages = parseInt(response.headers.get('X-Content-Pages'));
const generated = response.headers.get('X-Generated-Date');

// Split by page separators
const separator = '='.repeat(80) + '\nPAGE ';
const pages = content.split(separator);

// Extract table of contents
const tocStart = content.indexOf('Table of Contents:');
const tocEnd = content.indexOf('='.repeat(80) + '\nDOCUMENTATION CONTENT');
const toc = content.substring(tocStart, tocEnd);

// Process individual pages
pages.slice(1).forEach((page, index) => {
  if (page.includes('TITLE:')) {
    // Extract page metadata
    const title = page.split('TITLE: ')[1].split('\n')[0];
    const url = page.split('URL: ')[1].split('\n')[0];

    // Extract content
    const contentStart = page.indexOf('CONTENT\n' + '-'.repeat(80) + '\n\n');
    const contentEnd = page.indexOf('\n\n' + '-'.repeat(80) + '\nEND OF PAGE');
    const content = page.substring(contentStart, contentEnd);

    console.log(`Page ${index + 1}: ${title}`);
  }
});

# Download complete documentation
curl https://yourdomain.com/llms-full.txt -o docs.txt

# View headers
curl -I https://yourdomain.com/llms-full.txt

# Extract table of contents
curl https://yourdomain.com/llms-full.txt | \
  sed -n '/Table of Contents:/,/^===/p'

# Count pages
curl https://yourdomain.com/llms-full.txt | \
  grep -c "^PAGE [0-9]"

# Extract first page
curl https://yourdomain.com/llms-full.txt | \
  sed -n '/^PAGE 1 OF/,/^END OF PAGE 1/p'

Parsing Tips

Regular Expressions

import re

# Extract page numbers
page_pattern = r'PAGE (\d+) OF (\d+)'
matches = re.findall(page_pattern, content)

# Extract metadata fields
title_pattern = r'TITLE: (.+)'
url_pattern = r'URL: (.+)'
desc_pattern = r'DESCRIPTION: (.+)'

# Split by separators
separator_80 = r'={80}'
separator_dash = r'-{80}'

Content Extraction

def extract_pages(content: str) -> list:
    """Extract individual pages from llms-full.txt"""
    pages = []

    # Find all page sections
    page_pattern = r'={80}\nPAGE (\d+) OF (\d+)={80}(.+?)(?=={80}\nPAGE |\Z)'

    for match in re.finditer(page_pattern, content, re.DOTALL):
        page_num, total, page_content = match.groups()

        # Extract metadata
        metadata = {}
        for line in page_content.split('\n'):
            if ':' in line and line.isupper().startswith(line.split(':')[0]):
                key, value = line.split(':', 1)
                metadata[key.strip()] = value.strip()

        # Extract content
        content_match = re.search(
            r'CONTENT\n-{80}\n\n(.+?)\n\n-{80}',
            page_content,
            re.DOTALL
        )

        if content_match:
            pages.append({
                'page_number': int(page_num),
                'total_pages': int(total),
                'metadata': metadata,
                'content': content_match.group(1).strip()
            })

    return pages

Token Counting

def count_tokens_per_page(content: str) -> dict:
    """Estimate token count for each page"""
    import tiktoken

    enc = tiktoken.get_encoding("cl100k_base")
    pages = extract_pages(content)

    token_counts = {}
    for page in pages:
        page_content = page['content']
        tokens = len(enc.encode(page_content))
        token_counts[page['metadata']['TITLE']] = tokens

    return token_counts

Comparison with Previous Format

Before Enhancement

# Page Title (url)

Content...

# Another Page (url)

Content...

Limitations:

No metadata header
No table of contents
Basic separators
No page numbers
No HTTP headers

After Enhancement

================================================================================
HEADER WITH METADATA
================================================================================
...
Table of Contents: [all pages]
================================================================================
PAGE 1 OF 5
================================================================================
TITLE: ...
URL: ...
MARKDOWN: ...
...

Improvements:

✅ Rich metadata header
✅ Complete table of contents
✅ 80-character separators
✅ Page numbers (X OF Y)
✅ Custom HTTP headers
✅ Structured format

Best Practices

For RAG Systems

Parse metadata first - Get page count and base URL
Use table of contents - Quick overview of structure
Extract pages individually - Process one at a time
Respect token limits - Use page numbers to estimate size
Cache the response - Revalidate periodically

For Embeddings

Chunk by pages - Natural boundaries
Include metadata - Title, URL, description in embeddings
Cross-reference - Use URLs for linking
Update regularly - Check X-Generated-Date header

For Analysis

Validate structure - Check separator consistency
Handle errors - Missing descriptions are optional
Use HTTP headers - Metadata without parsing
Test parsing - Verify on sample data first

Testing

Verify Format

# Download and inspect
curl https://yourdomain.com/llms-full.txt > docs.txt

# Check header
head -50 docs.txt

# Count separators (should be consistent)
grep -c "^====" docs.txt
grep -c "^----" docs.txt

# Verify page numbers
grep "^PAGE [0-9]" docs.txt

Validate Headers

# Check custom headers
curl -I https://yourdomain.com/llms-full.txt | grep "X-"

# Expected output:
# X-Content-Pages: 5
# X-Generated-Date: 2025-10-14T12:00:00.000Z

AI Integration - Complete AI/LLM guide
Testing Guide - Verify your setup
Quick Reference - Commands and endpoints