Entity Extraction

Entity Extraction identifies and tracks people, organizations, techniques, datasets, and concepts mentioned in articles using spaCy's Named Entity Recognition (NER) models.

Overview

The entity extractor:

Extracts entities from article text using spaCy NER
Normalizes entity names to canonical forms (e.g., "G. Hinton" → "Geoffrey Hinton")
Tracks entity mentions across articles with confidence scores
Enables full-text search across entities and aliases

Architecture

Entity Types

Supported entity types:

person: Geoffrey Hinton, Yann LeCun, Ilya Sutskever
organization: OpenAI, Google Brain, Anthropic
technique: Transformers, RLHF, LoRA, BERT
dataset: ImageNet, COCO, WikiText-103
concept: Attention mechanism, Backpropagation

Features

Named Entity Recognition

Uses spaCy's en_core_web_sm model to detect entities:

from ai_web_feeds.nlp import EntityExtractor

extractor = EntityExtractor()

article = {
    "id": 1,
    "title": "GPT-4 by OpenAI",
    "content": "OpenAI released GPT-4, led by Sam Altman..."
}

entities = extractor.extract_entities(article)
# Returns: [
#     {"text": "OpenAI", "type": "organization", "confidence": 0.91},
#     {"text": "GPT-4", "type": "technique", "confidence": 0.96},
#     {"text": "Sam Altman", "type": "person", "confidence": 0.89}
# ]

Entity Normalization

Automatically merges similar entities using Levenshtein distance:

# "Geoffrey Hinton" vs "G. Hinton" → Merged (distance ≤ 2)
# "OpenAI" vs "Open AI" → Merged (distance = 1)

Algorithm:

Title-case normalization
Compare to existing entities of same type
If Levenshtein distance ≤ 2, use existing canonical name
Otherwise, create new entity

Full-Text Search

SQLite FTS5 virtual table enables fast entity search:

# Search the indexed corpus for entity mentions
ai-web-feeds search query "hinton"

Usage

CLI Commands

Extract Entities

ai-web-feeds nlp entities

Options:

--batch-size: Number of articles (default: 50)
--force: Reprocess all articles

# Process 25 articles
ai-web-feeds nlp entities --batch-size 25

NLP Processing Stats

ai-web-feeds nlp stats

Search Entity Mentions

ai-web-feeds search query "Geoffrey Hinton"

Search results include:

matching feed or article text
source metadata
relevance ranking

Reprocess Entity Extraction

ai-web-feeds nlp entities --force

Python API

from ai_web_feeds.nlp import EntityExtractor
from ai_web_feeds.storage import Storage

extractor = EntityExtractor()
storage = Storage()

# Extract entities
article = storage.get_article_by_id(123)
entities = extractor.extract_entities(article)

# Store entities
for entity_data in entities:
    # Normalize name
    canonical_name = extractor.normalize_entity(
        entity_data["text"],
        entity_data["type"],
        existing_entities=storage.list_all_entity_names()
    )

    # Get or create entity
    entity = storage.get_entity_by_name(canonical_name)
    if not entity:
        entity = storage.create_entity(
            canonical_name=canonical_name,
            entity_type=entity_data["type"]
        )

    # Record mention
    storage.create_entity_mention(
        entity_id=entity.id,
        article_id=article["id"],
        confidence=entity_data["confidence"],
        extraction_method="ner_model",
        context=entity_data["context"]
    )

Batch Processing

Entity extraction runs hourly via APScheduler:

from ai_web_feeds.nlp.scheduler import NLPScheduler

nlp_scheduler = NLPScheduler(scheduler)
nlp_scheduler.register_jobs()
# Registers: Entity extraction job (every hour)

Database Schema

entities Table

CREATE TABLE entities (
    id TEXT PRIMARY KEY,  -- UUID
    canonical_name TEXT NOT NULL UNIQUE,
    entity_type TEXT NOT NULL CHECK(entity_type IN ('person', 'organization', 'technique', 'dataset', 'concept')),
    aliases TEXT,  -- JSON array
    description TEXT,
    metadata TEXT,  -- JSON object
    frequency_count INTEGER DEFAULT 0,
    first_seen DATETIME DEFAULT CURRENT_TIMESTAMP,
    last_seen DATETIME,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

entity_mentions Table

CREATE TABLE entity_mentions (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    entity_id TEXT NOT NULL REFERENCES entities(id),
    article_id INTEGER NOT NULL,
    confidence REAL NOT NULL CHECK(confidence BETWEEN 0 AND 1),
    extraction_method TEXT NOT NULL CHECK(extraction_method IN ('ner_model', 'rule_based', 'manual')),
    context TEXT,  -- Surrounding text snippet
    mentioned_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (entity_id) REFERENCES entities(id),
    FOREIGN KEY (article_id) REFERENCES articles(id)
);

FTS5 Virtual Table

CREATE VIRTUAL TABLE entities_fts USING fts5(
    entity_id UNINDEXED,
    canonical_name,
    aliases,
    description
);

Model Installation

The first run will download the spaCy model (~13MB):

# Manual download (optional)
uv run python -m spacy download en_core_web_sm

Model Info:

Name: en_core_web_sm
Size: 13MB
Language: English
Accuracy: ~85% F1 score on OntoNotes 5.0

Configuration

class Phase5Settings(BaseSettings):
    entity_batch_size: int = 50
    entity_cron: str = "0 * * * *"  # Every hour
    entity_confidence_threshold: float = 0.7
    spacy_model: str = "en_core_web_sm"

Environment Variables:

PHASE5_ENTITY_BATCH_SIZE=50
PHASE5_ENTITY_CONFIDENCE_THRESHOLD=0.7
PHASE5_SPACY_MODEL=en_core_web_sm

Performance

Throughput: ~50 articles/hour
Memory: ~200MB (spaCy model loaded)
Storage: ~50 bytes per entity mention

Use Cases

Track Influential Researchers

# Find top AI researchers by mention frequency
ai-web-feeds search query "AI researcher"

Discover Emerging Techniques

# Find recently mentioned techniques
ai-web-feeds search query "technique"

Build Knowledge Graphs

Connect entities by co-occurrence in articles:

# Articles mentioning both "GPT-4" and "RLHF"
storage.get_articles_mentioning_entities(["GPT-4", "RLHF"])

Troubleshooting

Low Extraction Accuracy

Symptom: Many entities missed or incorrectly classified.

Solutions:

Use larger spaCy model: en_core_web_lg (40MB, better accuracy)
Add domain-specific rules for AI terminology
Re-run extraction with ai-web-feeds nlp entities --force

Duplicate Entities

Symptom: "Geoffrey Hinton" and "Geoff Hinton" as separate entities.

Solution:

# Reprocess entity records after curation data changes
ai-web-feeds nlp entities --force

spaCy Model Not Found

Symptom: OSError: Can't find model 'en_core_web_sm'

Solution:

uv run python -m spacy download en_core_web_sm

Entity Extraction

Entity Extraction

Overview

Architecture

Entity Types

Features

Named Entity Recognition

Entity Normalization

Full-Text Search

Usage

CLI Commands

Extract Entities

NLP Processing Stats

Search Entity Mentions

Reprocess Entity Extraction

Python API

Batch Processing

Database Schema

entities Table

entity_mentions Table

FTS5 Virtual Table

Model Installation

Configuration

Performance

Use Cases

Track Influential Researchers

Discover Emerging Techniques

Build Knowledge Graphs

Troubleshooting

Low Extraction Accuracy

Duplicate Entities

spaCy Model Not Found

See Also

On this page