AI Services & NER - User Guide¶
Version 1.6.x | January 2026¶
Table of Contents¶
- Introduction
- Accessing AI Services Settings
- Configuration Options
- Named Entity Recognition (NER)
- AI Summarization
- Spell Checking
- NER Review Dashboard
- Batch Processing
- Troubleshooting
Introduction¶
The AI Services module provides intelligent automation for archival description:
- Named Entity Recognition (NER): Automatically extract people, organizations, places, and dates
- AI Summarization: Generate summaries from PDF documents
- Spell Checking: Identify spelling errors in metadata
Workflow Overview¶
┌─────────────────────────────────────────────────────────────────┐
│ AI Services Workflow │
└─────────────────────────────────────────────────────────────────┘
┌─────────────┐
│ Record │
│ Upload │
└──────┬──────┘
│
▼
┌─────────────┐
│ Extract │──── From metadata fields
│ Text │──── From attached PDFs
└──────┬──────┘
│
┌───────┴───────┐
▼ ▼
┌───────────┐ ┌───────────┐
│ NER │ │ Summarize │
│ Extraction│ │ (PDF) │
└─────┬─────┘ └─────┬─────┘
│ │
▼ ▼
┌───────────┐ ┌───────────┐
│ Entities │ │ Scope & │
│ Stored │ │ Content │
└─────┬─────┘ └───────────┘
│
▼
┌───────────┐
│ Review │
│ Dashboard │
└───────────┘
Accessing AI Services Settings¶
- Log in as administrator
- Navigate to Admin → AHG Settings → AI Services
- Or go directly to:
/admin/ahg-settings/ai-services
Configuration Options¶
API Configuration¶
| Setting | Description | Default |
|---|---|---|
| API URL | AI service endpoint | http://localhost:5004/ai/v1 |
| API Key | Authentication key | - |
| Timeout | Request timeout (seconds) | 60 |
| Processing Mode | Hybrid (direct) or Job (Gearman) |
Job |
NER Settings¶
| Setting | Description | Default |
|---|---|---|
| Enable NER | Turn NER on/off | ✓ On |
| Extract from PDFs | Extract text from attached PDFs | ✓ On |
| Auto-extract on Upload | Run NER when records created | Off |
| Require Review | Manual review before linking | ✓ On |
| Entity Types | Types to extract | PERSON, ORG, GPE, DATE |
Summarization Settings¶
| Setting | Description | Default |
|---|---|---|
| Enable Summarization | Turn on/off | ✓ On |
| Target Field | Where to save summaries | Scope and Content |
| Min Length | Minimum characters | 100 |
| Max Length | Maximum characters | 500 |
Spell Check Settings¶
| Setting | Description | Default |
|---|---|---|
| Enable Spell Check | Turn on/off | Off |
| Language | Dictionary language | en_ZA |
| Fields to Check | Metadata fields | title, scopeAndContent |
Named Entity Recognition (NER)¶
What is NER?¶
NER automatically identifies and classifies named entities in text into predefined categories.
Entity Types¶
| Type | Code | Examples |
|---|---|---|
| Person | PERSON | Nelson Mandela, Cheryl Carolus |
| Organization | ORG | ANC, Department of Education |
| Location | GPE | Johannesburg, South Africa |
| Date | DATE | 18 January 1993, 1994 |
How It Works¶
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Archival │ │ Text │ │ NER API │
│ Record │─────►│ Extraction │─────►│ Processing │
└──────────────┘ └──────────────┘ └──────────────┘
│
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Review │◄─────│ Pending │◄─────│ Entities │
│ Dashboard │ │ Entities │ │ Stored │
└──────────────┘ └──────────────┘ └──────────────┘
Text Sources¶
NER extracts text from multiple sources:
- Metadata Fields
- Title
- Scope and Content
-
Archival History
-
Attached PDFs (when "Extract from PDFs" is enabled)
- Uses
pdftotextfor text extraction - Limited to first 50,000 characters per document
- PDFs must contain searchable text (not scanned images)
Viewing Extracted Entities¶
- Navigate to a record's view page
- Look for the Entities section in the sidebar
- Entities are grouped by type (People, Organizations, Places, Dates)
AI Summarization¶
Overview¶
AI Summarization automatically generates concise summaries from PDF documents and saves them to the specified metadata field (typically Scope and Content).
Workflow¶
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Record │ │ Extract │ │ Text │
│ with PDF │─────►│ PDF Text │─────►│ >= 200 │
└──────────────┘ └──────────────┘ │ chars? │
└──────┬───────┘
│ Yes
▼
┌──────────────┐
│ AI Summary │
│ Generated │
└──────┬───────┘
│
▼
┌──────────────┐
│ Saved to │
│ Scope&Content│
└──────────────┘
Requirements¶
- PDF must contain searchable text (not scanned images)
- Minimum 200 characters of extractable text
- For scanned documents, run OCR first
Best Practices¶
- Review generated summaries - AI summaries should be reviewed for accuracy
- OCR scanned documents - Run OCR before summarization for best results
- Check historical documents - Names and places may need verification
Spell Checking¶
Overview¶
Spell checking identifies potential spelling errors in metadata fields using language-specific dictionaries.
Supported Languages¶
| Code | Language |
|---|---|
| en_ZA | English (South Africa) |
| en_US | English (United States) |
| en_GB | English (United Kingdom) |
| af_ZA | Afrikaans |
Fields Checked¶
By default, spell checking runs on: - Title - Scope and Content
Additional fields can be configured in settings.
Result Status¶
| Status | Description |
|---|---|
| Pending | Not yet reviewed |
| Reviewed | Corrections applied |
| Ignored | Marked as false positive |
NER Review Dashboard¶
Accessing the Dashboard¶
Navigate to: /ner/review or Admin → NER Review
Dashboard Interface¶
┌─────────────────────────────────────────────────────────────────┐
│ NER Review Dashboard │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Filter: [All Types ▼] [All Status ▼] [Search... ] [Apply] │
│ │
├─────────────────────────────────────────────────────────────────┤
│ Entity │ Type │ Record │ Status │ Actions │
├──────────────────┼────────┼───────────────┼─────────┼───────────┤
│ Nelson Mandela │ PERSON │ Letter 1993 │ Pending │ [✓] [✗] │
│ ANC │ ORG │ Minutes 1994 │ Pending │ [✓] [✗] │
│ Johannesburg │ GPE │ Report 1992 │ Approved│ [View] │
│ 18 Jan 1993 │ DATE │ Memo 1993 │ Pending │ [✓] [✗] │
└─────────────────────────────────────────────────────────────────┘
Review Actions¶
| Action | Icon | Description |
|---|---|---|
| Approve | ✓ | Confirm entity is correct |
| Reject | ✗ | Mark entity as incorrect |
| Edit | ✎ | Modify value or type |
| Link | 🔗 | Link to existing authority record |
Bulk Actions¶
- Approve Selected: Approve multiple entities at once
- Reject Selected: Reject multiple entities
- Export: Export entities to CSV for external processing
Linking to Authority Records¶
When approving a PERSON or ORG entity, you can link it to an existing authority record:
- Click Approve on the entity
- Search for existing authority record
- Select match or create new
- Entity is linked for future reference
Batch Processing¶
Overview¶
For large archives, batch processing via CLI is more efficient than processing records individually.
CLI Commands¶
NER Extraction¶
# Extract from all unprocessed records
php symfony ner:extract --all --limit=1000
# Extract from specific repository
php symfony ner:extract --repository=5 --limit=500
# Extract from single record
php symfony ner:extract --object=12345
# Preview only (dry run)
php symfony ner:extract --all --dry-run --limit=10
# Force PDF extraction regardless of setting
php symfony ner:extract --all --with-pdf --limit=100
Summarization¶
# Summarize records with empty scope_and_content
php symfony ner:summarize --all-empty --limit=100
# Summarize specific record
php symfony ner:summarize --object=12345
# Specify different target field
php symfony ner:summarize --all-empty --field=abstract --limit=100
Spell Check¶
# Check all records
php symfony ner:spellcheck --all --limit=100
# Check specific repository
php symfony ner:spellcheck --repository=5 --limit=500
# Use different language
php symfony ner:spellcheck --all --language=af_ZA --limit=100
Running Long Batches¶
For large-scale processing, use screen to run in the background:
# Start a screen session
screen -S batch_ner
# Run the batch
php symfony ner:extract --all --limit=100000
# Detach from screen: Ctrl+A, then D
# Reattach later: screen -r batch_ner
Monitoring Progress¶
# Quick status check
mysql -u root database -e "
SELECT COUNT(*) as processed FROM ahg_ner_extraction;
SELECT COUNT(*) as entities FROM ahg_ner_entity;
"
# Detailed progress
mysql -u root database -e "
SELECT
(SELECT COUNT(*) FROM ahg_ner_extraction) as processed,
(SELECT COUNT(*) FROM ahg_ner_entity) as entities,
(SELECT COUNT(*) FROM information_object WHERE id != 1) -
(SELECT COUNT(*) FROM ahg_ner_extraction) as pending;
"
Troubleshooting¶
Common Issues¶
No Entities Found¶
Symptoms: Records processed but no entities extracted
Possible Causes: 1. Empty metadata fields 2. No PDF attached 3. PDF is image-only (not searchable) 4. "Extract from PDFs" disabled
Solutions: 1. Enable "Extract from PDFs" in settings 2. Ensure PDFs contain searchable text 3. Run OCR on scanned documents first 4. Check record has content in metadata fields
"Text too short" Error¶
Symptoms: Summarization fails with "Text too short"
Cause: Document has less than 200 characters of text
Solution: This is normal for brief records - summarization is skipped. No action needed.
API Connection Error¶
Symptoms: "API error: HTTP 0" or timeout errors
Solutions:
1. Verify API URL in settings is correct
2. Check AI service is running: curl http://API_URL/health
3. Verify API key is correct
4. Increase timeout for large documents
Elasticsearch Errors¶
Symptoms: Errors when saving records after processing
Solutions:
1. Check ES running: systemctl status elasticsearch
2. Verify Elastica version matches Elasticsearch version
3. Rebuild index: php symfony search:populate
Summary Not Saving¶
Symptoms: "Processed" reported but no summary in record
Possible Cause: Field name mismatch
Solution: Ensure summary_field setting is scope_and_content (with underscores)
Quick Reference Card¶
URLs¶
| Page | URL |
|---|---|
| AI Settings | /admin/ahg-settings/ai-services |
| NER Review | /ner/review |
CLI Commands¶
# NER
php symfony ner:extract --all --limit=N
php symfony ner:extract --object=ID
# Summarize
php symfony ner:summarize --all-empty --limit=N
php symfony ner:summarize --object=ID
# Spell Check
php symfony ner:spellcheck --all --limit=N
Monitor Progress¶
SELECT COUNT(*) FROM ahg_ner_extraction; -- Processed
SELECT COUNT(*) FROM ahg_ner_entity; -- Entities
Entity Types¶
| Type | Description |
|---|---|
| PERSON | Individual names |
| ORG | Organizations |
| GPE | Places/Locations |
| DATE | Dates and periods |
© The Archive and Heritage Group | January 2026