Topic Modeling

Topic Modeling automatically discovers subtopics within parent topics using Latent Dirichlet Allocation (LDA) and tracks topic evolution over time.

Overview

The topic modeler:

Discovers subtopics using LDA clustering
Tracks topic evolution (splits, merges, emergence, decline)
Enables manual curation of discovered subtopics
Computes topic coherence scores for quality assessment

Architecture

LDA Topic Modeling

Algorithm

Latent Dirichlet Allocation (LDA) discovers latent topics in document collections:

Preprocessing: Tokenize, remove stopwords, apply TF-IDF
Model Training: Learn topic distributions using Gensim LDA
Topic Extraction: Extract keywords and descriptions
Coherence Scoring: Validate topic quality using C_v coherence

Model Parameters

lda_config = {
    "num_topics": 10,              # Number of subtopics per parent
    "passes": 10,                  # Training iterations
    "iterations": 400,             # Inference iterations
    "alpha": "auto",               # Document-topic density
    "eta": "auto",                 # Topic-word density
    "minimum_probability": 0.01,   # Minimum topic probability
}

Usage

CLI Commands

Run Topic Modeling

ai-web-feeds nlp topics

Options:

--parent-topic: Parent topic to model (default: all)
--topic: Topic ID to model (default: all)
--min-articles: Minimum articles required (default: 10)
--force: Reprocess existing topic-modeling records

# Discover subtopics in NLP with minimum 50 articles
ai-web-feeds nlp topics --topic "nlp" --min-articles 50

Review Processing Stats

ai-web-feeds nlp stats

Interactive Workflow:

Unapproved Subtopics (3)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[1] NLP > Transformer Architectures
    Keywords: transformer, attention, bert, gpt, architecture
    Articles: 45
    Coherence: 0.68

    Actions: [a]pprove, [r]ename, [d]elete, [s]kip

> a

✓ Approved: Transformer Architectures

Search Topic Mentions

# Search articles and source metadata by topic language
ai-web-feeds search query "AI Safety"

Reprocess Topic Modeling

ai-web-feeds nlp topics --force --min-articles 20

Output:

Topic Evolution Events (Last 30 Days)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Date       Event      Source Topic       Target Topics
2023-10-15 split      Transformers       [BERT-variants, GPT-variants]
2023-10-22 emergence  -                  [Constitutional AI]
2023-10-28 merge      [RLHF, HHH]        Alignment Techniques

Python API

from ai_web_feeds.nlp import TopicModeler
from ai_web_feeds.storage import Storage

modeler = TopicModeler()
storage = Storage()

# Get articles for parent topic
articles = storage.get_articles_by_topic("NLP", limit=1000)

# Train LDA model
subtopics = modeler.discover_subtopics(
    topic="NLP",
    articles=articles,
    num_topics=10
)

# subtopics = [
#     {
#         "name": "Transformer Architectures",
#         "keywords": ["transformer", "attention", "bert", "gpt"],
#         "description": "Articles about transformer models...",
#         "article_count": 45,
#         "coherence": 0.68
#     },
#     ...
# ]

# Store subtopics
for subtopic_data in subtopics:
    storage.create_subtopic(
        topic="NLP",
        name=subtopic_data["name"],
        keywords=subtopic_data["keywords"],
        description=subtopic_data["description"],
        article_count=subtopic_data["article_count"]
    )

Batch Processing

Topic modeling runs monthly (1st of month, 3 AM):

from ai_web_feeds.nlp.scheduler import NLPScheduler

nlp_scheduler = NLPScheduler(scheduler)
nlp_scheduler.register_jobs()
# Registers: Topic modeling job (monthly)

Database Schema

subtopics Table

CREATE TABLE subtopics (
    id TEXT PRIMARY KEY,  -- UUID
    parent_topic TEXT NOT NULL,
    name TEXT NOT NULL,
    keywords TEXT NOT NULL,  -- JSON array
    description TEXT,
    article_count INTEGER DEFAULT 0,
    detected_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    approved BOOLEAN DEFAULT FALSE,
    created_by TEXT DEFAULT 'system',
    UNIQUE(parent_topic, name)
);

topic_evolution_events Table

CREATE TABLE topic_evolution_events (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    event_type TEXT NOT NULL CHECK(event_type IN ('split', 'merge', 'emergence', 'decline')),
    source_topic TEXT,
    target_topics TEXT,  -- JSON array
    article_count INTEGER NOT NULL,
    growth_rate REAL,
    detected_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

Topic Evolution Detection

Evolution Types

Split: One topic divides into multiple subtopics

Transformers → [BERT-variants, GPT-variants, ViT]

Merge: Multiple subtopics combine into one

[Supervised Learning, Unsupervised Learning] → Machine Learning Fundamentals

Emergence: New topic appears (growth rate > 100%)

- → Constitutional AI (50 articles in 1 month)

Decline: Topic activity decreases (growth rate < -50%)

GANs → (declining mention frequency)

Detection Algorithm

def detect_subtopic_set_changes(
    current_topics: List[Subtopic],
    previous_topics: List[Subtopic]
) -> List[EvolutionEvent]:
    """Compare current vs previous month's topics"""

    events = []

    # Detect splits
    for prev_topic in previous_topics:
        similar_topics = find_similar_topics(prev_topic, current_topics)
        if len(similar_topics) >= 2:
            events.append({
                "type": "split",
                "source": prev_topic.name,
                "targets": [t.name for t in similar_topics]
            })

    # Detect emergence
    for curr_topic in current_topics:
        if not any(is_similar(curr_topic, pt) for pt in previous_topics):
            growth_rate = compute_growth_rate(curr_topic)
            if growth_rate > 1.0:  # >100% growth
                events.append({
                    "type": "emergence",
                    "target": curr_topic.name,
                    "growth_rate": growth_rate
                })

    return events

Topic Coherence

Coherence Metric

Topic coherence (C_v) measures topic quality:

Range: 0.0 (poor) to 1.0 (excellent)
Threshold: Reject topics with coherence < 0.5
Interpretation:
- 0.7+: Excellent, semantically coherent
- 0.5-0.7: Good, acceptable
- <0.5: Poor, review manually

Computation

from gensim.models.coherencemodel import CoherenceModel

coherence_model = CoherenceModel(
    model=lda_model,
    texts=tokenized_docs,
    dictionary=dictionary,
    coherence='c_v'
)

coherence_score = coherence_model.get_coherence()

Configuration

class Phase5Settings(BaseSettings):
    topic_modeling_cron: str = "0 3 1 * *"  # 3 AM on 1st of month
    topic_model: str = "lda"  # Algorithm: lda, nmf, or bertopic
    topic_coherence_min: float = 0.5
    nlp_workers: int = 4  # Parallel processing

Environment Variables:

PHASE5_TOPIC_MODEL=lda
PHASE5_TOPIC_COHERENCE_MIN=0.5
PHASE5_NLP_WORKERS=4

Performance

Training Time: ~5-10 minutes for 1000 articles
Memory: ~1GB peak during training
Storage: ~200 bytes per subtopic

Manual Curation Workflow

1. Run Topic Modeling

ai-web-feeds nlp topics --topic "ai-safety"

2. Review Processing Stats

ai-web-feeds nlp stats

3. Search Discovered Topic Language

ai-web-feeds search query "AI Safety"

Use Cases

Discover Emerging Subtopics

Monitor new research areas:

# Monthly check for new subtopics in "AI"
ai-web-feeds nlp topics --topic "ai"
ai-web-feeds nlp stats

Track Topic Fragmentation

Identify when broad topics split:

# Check if "Deep Learning" has fragmented
ai-web-feeds nlp topics --topic "deep-learning" --force

Content Organization

Use subtopics for navigation and filtering:

# Show articles in specific subtopic
ai-web-feeds search query "Transformer Architectures"

Troubleshooting

Low Coherence Scores

Symptom: All subtopics have coherence < 0.5.

Causes:

Too few articles (< 100)
Too many subtopics requested
Poor text preprocessing

Solutions:

# Reduce number of topics
ai-web-feeds nlp topics --num-topics 5

# Increase minimum articles
ai-web-feeds nlp topics --min-articles 200

Topics Too Broad

Symptom: Subtopics are generic and overlap.

Solution: Increase num_topics parameter to get more specific clusters:

ai-web-feeds nlp topics --num-topics 15

Model Training Fails

Symptom: MemoryError or training hangs.

Solution:

Reduce batch size
Limit article count: --max-articles 500
Increase system memory or use cloud instance

Advanced Features

BERTopic (Future)

Alternative to LDA using transformer embeddings:

# Planned: BERTopic support
modeler = TopicModeler(algorithm="bertopic")
subtopics = modeler.discover_subtopics(topic="NLP", articles=articles)

Advantages:

Better semantic understanding
No need to specify number of topics
Higher coherence scores

Trade-offs:

Slower training (GPU recommended)
Higher memory usage (~2GB)

Topic Modeling

On this page