Topic Modeling
LDA-based topic discovery and evolution tracking
Source: apps/web/content/docs/features/topic-modeling.mdx
Topic Modeling
Topic Modeling automatically discovers subtopics within parent topics using Latent Dirichlet Allocation (LDA) and tracks topic evolution over time.
Overview
The topic modeler:
- Discovers subtopics using LDA clustering
- Tracks topic evolution (splits, merges, emergence, decline)
- Enables manual curation of discovered subtopics
- Computes topic coherence scores for quality assessment
Architecture
LDA Topic Modeling
Algorithm
Latent Dirichlet Allocation (LDA) discovers latent topics in document collections:
- Preprocessing: Tokenize, remove stopwords, apply TF-IDF
- Model Training: Learn topic distributions using Gensim LDA
- Topic Extraction: Extract keywords and descriptions
- Coherence Scoring: Validate topic quality using C_v coherence
Model Parameters
lda_config = {
"num_topics": 10, # Number of subtopics per parent
"passes": 10, # Training iterations
"iterations": 400, # Inference iterations
"alpha": "auto", # Document-topic density
"eta": "auto", # Topic-word density
"minimum_probability": 0.01, # Minimum topic probability
}Usage
CLI Commands
Run Topic Modeling
ai-web-feeds nlp topicsOptions:
--parent-topic: Parent topic to model (default: all)--topic: Topic ID to model (default: all)--min-articles: Minimum articles required (default: 10)--force: Reprocess existing topic-modeling records
# Discover subtopics in NLP with minimum 50 articles
ai-web-feeds nlp topics --topic "nlp" --min-articles 50Review Processing Stats
ai-web-feeds nlp statsInteractive Workflow:
Unapproved Subtopics (3)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[1] NLP > Transformer Architectures
Keywords: transformer, attention, bert, gpt, architecture
Articles: 45
Coherence: 0.68
Actions: [a]pprove, [r]ename, [d]elete, [s]kip
> a
✓ Approved: Transformer ArchitecturesSearch Topic Mentions
# Search articles and source metadata by topic language
ai-web-feeds search query "AI Safety"Reprocess Topic Modeling
ai-web-feeds nlp topics --force --min-articles 20Output:
Topic Evolution Events (Last 30 Days)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Date Event Source Topic Target Topics
2023-10-15 split Transformers [BERT-variants, GPT-variants]
2023-10-22 emergence - [Constitutional AI]
2023-10-28 merge [RLHF, HHH] Alignment TechniquesPython API
from ai_web_feeds.nlp import TopicModeler
from ai_web_feeds.storage import Storage
modeler = TopicModeler()
storage = Storage()
# Get articles for parent topic
articles = storage.get_articles_by_topic("NLP", limit=1000)
# Train LDA model
subtopics = modeler.discover_subtopics(
topic="NLP",
articles=articles,
num_topics=10
)
# subtopics = [
# {
# "name": "Transformer Architectures",
# "keywords": ["transformer", "attention", "bert", "gpt"],
# "description": "Articles about transformer models...",
# "article_count": 45,
# "coherence": 0.68
# },
# ...
# ]
# Store subtopics
for subtopic_data in subtopics:
storage.create_subtopic(
topic="NLP",
name=subtopic_data["name"],
keywords=subtopic_data["keywords"],
description=subtopic_data["description"],
article_count=subtopic_data["article_count"]
)Batch Processing
Topic modeling runs monthly (1st of month, 3 AM):
from ai_web_feeds.nlp.scheduler import NLPScheduler
nlp_scheduler = NLPScheduler(scheduler)
nlp_scheduler.register_jobs()
# Registers: Topic modeling job (monthly)Database Schema
subtopics Table
CREATE TABLE subtopics (
id TEXT PRIMARY KEY, -- UUID
parent_topic TEXT NOT NULL,
name TEXT NOT NULL,
keywords TEXT NOT NULL, -- JSON array
description TEXT,
article_count INTEGER DEFAULT 0,
detected_at DATETIME DEFAULT CURRENT_TIMESTAMP,
approved BOOLEAN DEFAULT FALSE,
created_by TEXT DEFAULT 'system',
UNIQUE(parent_topic, name)
);topic_evolution_events Table
CREATE TABLE topic_evolution_events (
id INTEGER PRIMARY KEY AUTOINCREMENT,
event_type TEXT NOT NULL CHECK(event_type IN ('split', 'merge', 'emergence', 'decline')),
source_topic TEXT,
target_topics TEXT, -- JSON array
article_count INTEGER NOT NULL,
growth_rate REAL,
detected_at DATETIME DEFAULT CURRENT_TIMESTAMP
);Topic Evolution Detection
Evolution Types
Split: One topic divides into multiple subtopics
Transformers → [BERT-variants, GPT-variants, ViT]Merge: Multiple subtopics combine into one
[Supervised Learning, Unsupervised Learning] → Machine Learning FundamentalsEmergence: New topic appears (growth rate > 100%)
- → Constitutional AI (50 articles in 1 month)Decline: Topic activity decreases (growth rate < -50%)
GANs → (declining mention frequency)Detection Algorithm
def detect_subtopic_set_changes(
current_topics: List[Subtopic],
previous_topics: List[Subtopic]
) -> List[EvolutionEvent]:
"""Compare current vs previous month's topics"""
events = []
# Detect splits
for prev_topic in previous_topics:
similar_topics = find_similar_topics(prev_topic, current_topics)
if len(similar_topics) >= 2:
events.append({
"type": "split",
"source": prev_topic.name,
"targets": [t.name for t in similar_topics]
})
# Detect emergence
for curr_topic in current_topics:
if not any(is_similar(curr_topic, pt) for pt in previous_topics):
growth_rate = compute_growth_rate(curr_topic)
if growth_rate > 1.0: # >100% growth
events.append({
"type": "emergence",
"target": curr_topic.name,
"growth_rate": growth_rate
})
return eventsTopic Coherence
Coherence Metric
Topic coherence (C_v) measures topic quality:
- Range: 0.0 (poor) to 1.0 (excellent)
- Threshold: Reject topics with coherence < 0.5
- Interpretation:
- 0.7+: Excellent, semantically coherent
- 0.5-0.7: Good, acceptable
- <0.5: Poor, review manually
Computation
from gensim.models.coherencemodel import CoherenceModel
coherence_model = CoherenceModel(
model=lda_model,
texts=tokenized_docs,
dictionary=dictionary,
coherence='c_v'
)
coherence_score = coherence_model.get_coherence()Configuration
class Phase5Settings(BaseSettings):
topic_modeling_cron: str = "0 3 1 * *" # 3 AM on 1st of month
topic_model: str = "lda" # Algorithm: lda, nmf, or bertopic
topic_coherence_min: float = 0.5
nlp_workers: int = 4 # Parallel processingEnvironment Variables:
PHASE5_TOPIC_MODEL=lda
PHASE5_TOPIC_COHERENCE_MIN=0.5
PHASE5_NLP_WORKERS=4Performance
- Training Time: ~5-10 minutes for 1000 articles
- Memory: ~1GB peak during training
- Storage: ~200 bytes per subtopic
Manual Curation Workflow
1. Run Topic Modeling
ai-web-feeds nlp topics --topic "ai-safety"2. Review Processing Stats
ai-web-feeds nlp stats3. Search Discovered Topic Language
ai-web-feeds search query "AI Safety"Use Cases
Discover Emerging Subtopics
Monitor new research areas:
# Monthly check for new subtopics in "AI"
ai-web-feeds nlp topics --topic "ai"
ai-web-feeds nlp statsTrack Topic Fragmentation
Identify when broad topics split:
# Check if "Deep Learning" has fragmented
ai-web-feeds nlp topics --topic "deep-learning" --forceContent Organization
Use subtopics for navigation and filtering:
# Show articles in specific subtopic
ai-web-feeds search query "Transformer Architectures"Troubleshooting
Low Coherence Scores
Symptom: All subtopics have coherence < 0.5.
Causes:
- Too few articles (< 100)
- Too many subtopics requested
- Poor text preprocessing
Solutions:
# Reduce number of topics
ai-web-feeds nlp topics --num-topics 5
# Increase minimum articles
ai-web-feeds nlp topics --min-articles 200Topics Too Broad
Symptom: Subtopics are generic and overlap.
Solution: Increase num_topics parameter to get more specific clusters:
ai-web-feeds nlp topics --num-topics 15Model Training Fails
Symptom: MemoryError or training hangs.
Solution:
- Reduce batch size
- Limit article count:
--max-articles 500 - Increase system memory or use cloud instance
Advanced Features
BERTopic (Future)
Alternative to LDA using transformer embeddings:
# Planned: BERTopic support
modeler = TopicModeler(algorithm="bertopic")
subtopics = modeler.discover_subtopics(topic="NLP", articles=articles)Advantages:
- Better semantic understanding
- No need to specify number of topics
- Higher coherence scores
Trade-offs:
- Slower training (GPU recommended)
- Higher memory usage (~2GB)
See Also
- Quality Scoring - Article quality assessment
- Entity Extraction - Named entity recognition
- Sentiment Analysis - Sentiment classification