Search Architecture
How the web app splits source search and post search across the checked-in catalog and generated article library.
Source: apps/web/content/docs/development/search-architecture.mdx
The public web app uses two local data sources for search:
- the checked-in catalog for source discovery
- the generated article library for recent-post browsing and post search
Public Search Surfaces
| Route | Backing data | Purpose |
|---|---|---|
/api/articles | data/articles.generated.json | Filtered post browsing with cursor pagination |
/api/search?scope=articles | data/articles.generated.json | Ranked post search |
/api/search?scope=sources | Catalog files under data/ | Source discovery |
/api/search/autocomplete | Catalog plus generated article library | Shared suggestions |
POST /api/search | Optional backend proxy | Analytics logging only |
Why the Split Exists
| Route | Backing data | Purpose |
|---|---|---|
/api/articles | data/articles.generated.json | filtered article browse with cursor pagination |
/api/search?scope=articles | data/articles.generated.json | ranked article search across the exported corpus |
/api/search?scope=sources | checked-in catalog files | source discovery and filtering |
/api/search/autocomplete | shared catalog + article corpus index | feed, article, and topic suggestions |
POST /api/search | optional backend proxy | analytics logging only |
Data Flow
Shared article corpus behavior
apps/web/lib/article-corpus.ts is the shared server-only adapter for:
- loading
data/articles.generated.json - normalizing article filters (
feed,topics,source_type,verified) - applying browse sort order (
latest,oldest,source) - serving article search and autocomplete from the same normalized article rows
That means article browse, article search, and article autocomplete suggestions now share one authoritative dataset instead of mixing bounded live fetches and catalog-only matching.
Source catalog behavior
Source search still uses the repository catalog because feed discovery is an authored-data concern.
The web loader prefers these files in order:
data/feeds.enriched.yamldata/feeds.yamldata/feeds.json
This keeps source discovery available even when the runtime database or corpus has not been refreshed yet.
Operational distinction
If source search looks correct but article browse/search is empty, check the generated corpus path before changing catalog files. The reader workspace depends on data/articles.generated.json, not on direct SQLite access.
Runtime Search Features
The Python runtime still has its own search stack for CLI and backend work, including SQLite FTS, autocomplete indexes, and optional embedding-based search. Those features remain useful operationally, but they are not required for the public / experience.
| Layer | Implementation | Purpose |
|---|---|---|
| Full-text | SQLite FTS5 virtual table + triggers | source search over stored runtime data |
| Autocomplete | trie index built from titles and topics | fast prefix suggestions |
| Semantic | embeddings + cosine similarity | meaning-based retrieval |
Those runtime capabilities are still useful for CLI workflows, but they are no
longer a prerequisite for the public / experience.
Search operations checklist
uv run ai-web-feeds corpus refresh
uv run ai-web-feeds corpus export
uv run ai-web-feeds search init
uv run ai-web-feeds search query "llm agents" --type full_text --limit 10