Entity Extraction
Named Entity Recognition and normalization using spaCy NER
Source: apps/web/content/docs/features/entity-extraction.mdx
Entity Extraction
Entity Extraction identifies and tracks people, organizations, techniques, datasets, and concepts mentioned in articles using spaCy's Named Entity Recognition (NER) models.
Overview
The entity extractor:
- Extracts entities from article text using spaCy NER
- Normalizes entity names to canonical forms (e.g., "G. Hinton" → "Geoffrey Hinton")
- Tracks entity mentions across articles with confidence scores
- Enables full-text search across entities and aliases
Architecture
Entity Types
Supported entity types:
- person: Geoffrey Hinton, Yann LeCun, Ilya Sutskever
- organization: OpenAI, Google Brain, Anthropic
- technique: Transformers, RLHF, LoRA, BERT
- dataset: ImageNet, COCO, WikiText-103
- concept: Attention mechanism, Backpropagation
Features
Named Entity Recognition
Uses spaCy's en_core_web_sm model to detect entities:
from ai_web_feeds.nlp import EntityExtractor
extractor = EntityExtractor()
article = {
"id": 1,
"title": "GPT-4 by OpenAI",
"content": "OpenAI released GPT-4, led by Sam Altman..."
}
entities = extractor.extract_entities(article)
# Returns: [
# {"text": "OpenAI", "type": "organization", "confidence": 0.91},
# {"text": "GPT-4", "type": "technique", "confidence": 0.96},
# {"text": "Sam Altman", "type": "person", "confidence": 0.89}
# ]Entity Normalization
Automatically merges similar entities using Levenshtein distance:
# "Geoffrey Hinton" vs "G. Hinton" → Merged (distance ≤ 2)
# "OpenAI" vs "Open AI" → Merged (distance = 1)Algorithm:
- Title-case normalization
- Compare to existing entities of same type
- If Levenshtein distance ≤ 2, use existing canonical name
- Otherwise, create new entity
Full-Text Search
SQLite FTS5 virtual table enables fast entity search:
# Search the indexed corpus for entity mentions
ai-web-feeds search query "hinton"Usage
CLI Commands
Extract Entities
ai-web-feeds nlp entitiesOptions:
--batch-size: Number of articles (default: 50)--force: Reprocess all articles
# Process 25 articles
ai-web-feeds nlp entities --batch-size 25NLP Processing Stats
ai-web-feeds nlp statsSearch Entity Mentions
ai-web-feeds search query "Geoffrey Hinton"Search results include:
- matching feed or article text
- source metadata
- relevance ranking
Reprocess Entity Extraction
ai-web-feeds nlp entities --forcePython API
from ai_web_feeds.nlp import EntityExtractor
from ai_web_feeds.storage import Storage
extractor = EntityExtractor()
storage = Storage()
# Extract entities
article = storage.get_article_by_id(123)
entities = extractor.extract_entities(article)
# Store entities
for entity_data in entities:
# Normalize name
canonical_name = extractor.normalize_entity(
entity_data["text"],
entity_data["type"],
existing_entities=storage.list_all_entity_names()
)
# Get or create entity
entity = storage.get_entity_by_name(canonical_name)
if not entity:
entity = storage.create_entity(
canonical_name=canonical_name,
entity_type=entity_data["type"]
)
# Record mention
storage.create_entity_mention(
entity_id=entity.id,
article_id=article["id"],
confidence=entity_data["confidence"],
extraction_method="ner_model",
context=entity_data["context"]
)Batch Processing
Entity extraction runs hourly via APScheduler:
from ai_web_feeds.nlp.scheduler import NLPScheduler
nlp_scheduler = NLPScheduler(scheduler)
nlp_scheduler.register_jobs()
# Registers: Entity extraction job (every hour)Database Schema
entities Table
CREATE TABLE entities (
id TEXT PRIMARY KEY, -- UUID
canonical_name TEXT NOT NULL UNIQUE,
entity_type TEXT NOT NULL CHECK(entity_type IN ('person', 'organization', 'technique', 'dataset', 'concept')),
aliases TEXT, -- JSON array
description TEXT,
metadata TEXT, -- JSON object
frequency_count INTEGER DEFAULT 0,
first_seen DATETIME DEFAULT CURRENT_TIMESTAMP,
last_seen DATETIME,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);entity_mentions Table
CREATE TABLE entity_mentions (
id INTEGER PRIMARY KEY AUTOINCREMENT,
entity_id TEXT NOT NULL REFERENCES entities(id),
article_id INTEGER NOT NULL,
confidence REAL NOT NULL CHECK(confidence BETWEEN 0 AND 1),
extraction_method TEXT NOT NULL CHECK(extraction_method IN ('ner_model', 'rule_based', 'manual')),
context TEXT, -- Surrounding text snippet
mentioned_at DATETIME DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (entity_id) REFERENCES entities(id),
FOREIGN KEY (article_id) REFERENCES articles(id)
);FTS5 Virtual Table
CREATE VIRTUAL TABLE entities_fts USING fts5(
entity_id UNINDEXED,
canonical_name,
aliases,
description
);Model Installation
The first run will download the spaCy model (~13MB):
# Manual download (optional)
uv run python -m spacy download en_core_web_smModel Info:
- Name:
en_core_web_sm - Size: 13MB
- Language: English
- Accuracy: ~85% F1 score on OntoNotes 5.0
Configuration
class Phase5Settings(BaseSettings):
entity_batch_size: int = 50
entity_cron: str = "0 * * * *" # Every hour
entity_confidence_threshold: float = 0.7
spacy_model: str = "en_core_web_sm"Environment Variables:
PHASE5_ENTITY_BATCH_SIZE=50
PHASE5_ENTITY_CONFIDENCE_THRESHOLD=0.7
PHASE5_SPACY_MODEL=en_core_web_smPerformance
- Throughput: ~50 articles/hour
- Memory: ~200MB (spaCy model loaded)
- Storage: ~50 bytes per entity mention
Use Cases
Track Influential Researchers
# Find top AI researchers by mention frequency
ai-web-feeds search query "AI researcher"Discover Emerging Techniques
# Find recently mentioned techniques
ai-web-feeds search query "technique"Build Knowledge Graphs
Connect entities by co-occurrence in articles:
# Articles mentioning both "GPT-4" and "RLHF"
storage.get_articles_mentioning_entities(["GPT-4", "RLHF"])Troubleshooting
Low Extraction Accuracy
Symptom: Many entities missed or incorrectly classified.
Solutions:
- Use larger spaCy model:
en_core_web_lg(40MB, better accuracy) - Add domain-specific rules for AI terminology
- Re-run extraction with
ai-web-feeds nlp entities --force
Duplicate Entities
Symptom: "Geoffrey Hinton" and "Geoff Hinton" as separate entities.
Solution:
# Reprocess entity records after curation data changes
ai-web-feeds nlp entities --forcespaCy Model Not Found
Symptom: OSError: Can't find model 'en_core_web_sm'
Solution:
uv run python -m spacy download en_core_web_smSee Also
- Quality Scoring - Article quality assessment
- Sentiment Analysis - Sentiment classification
- Topic Modeling - Discover subtopics