ahgAIPlugin - Technical Documentation¶
Version: 2.2.0 Category: AI Dependencies: atom-framework, ahgCorePlugin
Overview¶
Consolidated AI-powered tools plugin for AtoM providing Named Entity Recognition (NER), Translation, Summarization, Spellcheck, Handwriting Text Recognition (HTR), and LLM-powered Description Suggestions. This plugin consolidates previously separate NER and Translation plugins into a unified AI tools suite.
Key Features¶
| Feature | Description | Backend |
|---|---|---|
| NER | Named Entity Recognition (persons, organizations, places, dates) | Python/spaCy |
| Translate | Offline machine translation | Argos Translate |
| Summarize | AI-powered text summarization | Python |
| Spellcheck | Spelling and grammar checking | Python/aspell |
| HTR | Handwriting Text Recognition | Python/TrOCR |
| Suggest Description | LLM-based scope_and_content generation from OCR/metadata | Ollama/OpenAI/Anthropic |
| Job Queue | Batch processing with progress tracking, throttling, retry | Gearman/Cron |
Architecture¶
+-------------------------------------------------------------------------+
| ahgAIPlugin |
+-------------------------------------------------------------------------+
| |
| +--------------------------------------------------------------------+ |
| | Web Interface Layer | |
| | +--------------+ +---------------+ +--------------------+ | |
| | | AI Module | | Components | | Templates | | |
| | | (actions.php)| | (buttons) | | (review modals) | | |
| | +--------------+ +---------------+ +--------------------+ | |
| +--------------------------------------------------------------------+ |
| | |
| v |
| +--------------------------------------------------------------------+ |
| | Service Layer | |
| | +----------------+ +----------------+ +--------------------+ | |
| | | NerService | | NerRepository | | NerTrainingSync | | |
| | | - extract() | | - save/get | | - pushCorrections | | |
| | | - summarize() | | entities | | - exportToFile | | |
| | +----------------+ +----------------+ +--------------------+ | |
| | | |
| | +--------------------+ +--------------------+ +--------------+ | |
| | | DescriptionService | | LlmService | | PromptService| | |
| | | - generateSugg() | | - getProvider() | | - getTempl() | | |
| | | - approve/reject() | | - complete() | | - buildPrmpt | | |
| | | - gatherContext() | | - encrypt/decrypt | +--------------+ | |
| | +--------------------+ +--------------------+ | |
| | | |
| | +--------------------+ | |
| | | JobQueueService | ← Batch processing orchestrator | |
| | | - createBatch() | | |
| | | - processJob() | | |
| | | - checkServerLoad()| | |
| | +--------------------+ | |
| | | | |
| | +-------------------+-------------------+ | |
| | v v v | |
| | +------------------+ +------------------+ +------------------+ | |
| | | OllamaProvider | | OpenAIProvider | | AnthropicProvider| | |
| | | - complete() | | - complete() | | - complete() | | |
| | | - getModels() | | - getModels() | | - getModels() | | |
| | +------------------+ +------------------+ +------------------+ | |
| +--------------------------------------------------------------------+ |
| | |
| v |
| +--------------------------------------------------------------------+ |
| | Background Job Layer | |
| | +------------------+ +------------------+ | |
| | | arNerExtractJob | | ahgMediaTranscr. | | |
| | | - Gearman job | | - Transcription | | |
| | | - NER + Summarize| | job | | |
| | +------------------+ +------------------+ | |
| +--------------------------------------------------------------------+ |
| | |
| v |
| +--------------------------------------------------------------------+ |
| | CLI Task Layer | |
| | +------------+ +------------+ +------------+ +------------+ | |
| | | ai:ner- | | ai:trans- | | ai:summ- | | ai:spell- | | |
| | | extract | | late | | arize | | check | | |
| | +------------+ +------------+ +------------+ +------------+ | |
| | +------------+ +------------+ +------------+ +-------------------+ | |
| | | ai:install | | ai:uninstl | | ai:ner-sync| | ai:suggest-descr. | | |
| | +------------+ +------------+ +------------+ +-------------------+ | |
| +--------------------------------------------------------------------+ |
| | |
| v |
| +--------------------------------------------------------------------+ |
| | External Services | |
| | +------------------+ +------------------+ +------------------+ | |
| | | Python AI API | | Argos Translate | | LLM Providers | | |
| | | (spaCy NER) | | (offline) | | - Ollama (local) | | |
| | | (Summarization) | +------------------+ | - OpenAI API | | |
| | | (HTR models) | | aspell | | - Anthropic API | | |
| | +------------------+ | (spellcheck) | +------------------+ | |
| | +------------------+ | |
| +--------------------------------------------------------------------+ |
| |
+-------------------------------------------------------------------------+
Database Schema¶
ERD Diagram¶
+-----------------------------+ +----------------------------+
| ahg_ai_settings | | ahg_ai_usage |
+-----------------------------+ +----------------------------+
| PK id INT AUTO_INCREMENT | | PK id BIGINT UNSIGNED |
| feature VARCHAR(50) | | feature VARCHAR(50) |
| setting_key VARCHAR(100) | | user_id INT |
| setting_value TEXT | | api_key VARCHAR(100) |
| updated_at TIMESTAMP | | endpoint VARCHAR(100) |
+-----------------------------+ | request_size INT |
| response_time_ms INT |
| status_code INT |
| ip_address VARCHAR(45) |
| created_at TIMESTAMP |
+----------------------------+
+--------------------------------+
| ahg_ner_extraction |
+--------------------------------+
| PK id BIGINT UNSIGNED |
| object_id INT |----+
| backend_used VARCHAR(50) | |
| status VARCHAR(50) | |
| entity_count INT | |
| extracted_at TIMESTAMP | |
+--------------------------------+ |
| |
| 1:N |
v |
+--------------------------------+ |
| ahg_ner_entity | |
+--------------------------------+ |
| PK id BIGINT UNSIGNED | |
| FK extraction_id BIGINT |----+
| FK object_id INT |-----> information_object
| entity_type VARCHAR(50) |
| entity_value VARCHAR(500) |
| original_value VARCHAR(500) | (for training)
| original_type VARCHAR(50) | (for training)
| correction_type ENUM | (training feedback)
| training_exported TINYINT |
| confidence DECIMAL(5,4) |
| status VARCHAR(50) |
| linked_actor_id INT |-----> actor / term
| reviewed_by INT |-----> user
| reviewed_at TIMESTAMP |
| created_at TIMESTAMP |
+--------------------------------+
|
| 1:N
v
+--------------------------------+
| ahg_ner_entity_link |
+--------------------------------+
| PK id BIGINT UNSIGNED |
| FK entity_id BIGINT |
| FK actor_id INT |-----> actor
| link_type ENUM |
| confidence DECIMAL(5,4) |
| created_by INT |
| created_at TIMESTAMP |
+--------------------------------+
+--------------------------------+
| ahg_translation_queue |
+--------------------------------+
| PK id BIGINT UNSIGNED |
| object_id INT |-----> information_object
| source_culture VARCHAR(10) |
| target_culture VARCHAR(10) |
| fields TEXT (JSON) |
| status ENUM |
| error_message TEXT |
| created_by INT |
| created_at TIMESTAMP |
| processed_at TIMESTAMP |
+--------------------------------+
+--------------------------------+
| ahg_translation_log |
+--------------------------------+
| PK id BIGINT UNSIGNED |
| object_id INT |-----> information_object
| field_name VARCHAR(100) |
| source_culture VARCHAR(10) |
| target_culture VARCHAR(10) |
| source_text TEXT |
| translated_text TEXT |
| translation_engine VARCHAR |
| created_by INT |
| created_at TIMESTAMP |
+--------------------------------+
+--------------------------------+ +--------------------------------+
| ahg_llm_config | | ahg_prompt_template |
+--------------------------------+ +--------------------------------+
| PK id INT UNSIGNED | | PK id INT UNSIGNED |
| provider VARCHAR(50) | | name VARCHAR(100) |
| name VARCHAR(100) UNIQUE | | slug VARCHAR(100) UNIQUE |
| is_active TINYINT(1) | | system_prompt TEXT |
| is_default TINYINT(1) | | user_prompt_template TEXT |
| endpoint_url VARCHAR(500) | | level_of_description VARCHAR|
| api_key_encrypted TEXT | | repository_id INT |
| model VARCHAR(100) | | is_default TINYINT(1) |
| max_tokens INT | | is_active TINYINT(1) |
| temperature DECIMAL(3,2) | | include_ocr TINYINT(1) |
| timeout_seconds INT | | max_ocr_chars INT |
| created_at TIMESTAMP | | created_at TIMESTAMP |
| updated_at TIMESTAMP | +--------------------------------+
+--------------------------------+
|
| 1:N
v
+--------------------------------+
| ahg_description_suggestion |
+--------------------------------+
| PK id BIGINT UNSIGNED |
| object_id INT |-----> information_object
| suggested_text TEXT |
| existing_text TEXT |
| FK prompt_template_id INT |-----> ahg_prompt_template
| FK llm_config_id INT |-----> ahg_llm_config
| source_data JSON |
| status ENUM | (pending/approved/rejected/edited)
| edited_text TEXT |
| reviewed_by INT |-----> user
| reviewed_at TIMESTAMP |
| review_notes TEXT |
| generation_time_ms INT |
| tokens_used INT |
| model_used VARCHAR(100) |
| created_by INT |
| created_at TIMESTAMP |
| expires_at TIMESTAMP |
+--------------------------------+
SQL Schema¶
-- AI Settings table (replaces ahg_ner_settings)
CREATE TABLE IF NOT EXISTS ahg_ai_settings (
id INT AUTO_INCREMENT PRIMARY KEY,
feature VARCHAR(50) NOT NULL DEFAULT 'general',
setting_key VARCHAR(100) NOT NULL,
setting_value TEXT DEFAULT NULL,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
UNIQUE KEY uk_feature_key (feature, setting_key),
INDEX idx_ai_settings_feature (feature)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
-- API usage tracking table
CREATE TABLE IF NOT EXISTS ahg_ai_usage (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
feature VARCHAR(50) NOT NULL,
user_id INT DEFAULT NULL,
api_key VARCHAR(100) DEFAULT NULL,
endpoint VARCHAR(100) NOT NULL,
request_size INT DEFAULT 0,
response_time_ms INT DEFAULT NULL,
status_code INT DEFAULT 200,
ip_address VARCHAR(45) DEFAULT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_ai_usage_feature (feature),
INDEX idx_ai_usage_user (user_id),
INDEX idx_ai_usage_date (created_at)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
-- NER Extraction jobs table
CREATE TABLE IF NOT EXISTS ahg_ner_extraction (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
object_id INT NOT NULL,
backend_used VARCHAR(50) DEFAULT 'local',
status VARCHAR(50) DEFAULT 'pending',
entity_count INT DEFAULT 0,
extracted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_ner_extraction_object (object_id),
INDEX idx_ner_extraction_status (status)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
-- Extracted entities table with review workflow
CREATE TABLE IF NOT EXISTS ahg_ner_entity (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
extraction_id BIGINT UNSIGNED NULL,
object_id INT NOT NULL,
entity_type VARCHAR(50) NOT NULL,
entity_value VARCHAR(500) NOT NULL,
original_value VARCHAR(500) DEFAULT NULL,
original_type VARCHAR(50) DEFAULT NULL,
correction_type ENUM('none', 'value_edit', 'type_change', 'both', 'rejected', 'approved') DEFAULT 'none',
training_exported TINYINT(1) DEFAULT 0,
confidence DECIMAL(5,4) DEFAULT 1.0000,
status VARCHAR(50) DEFAULT 'pending',
linked_actor_id INT DEFAULT NULL,
reviewed_by INT DEFAULT NULL,
reviewed_at TIMESTAMP NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_ner_entity_extraction (extraction_id),
INDEX idx_ner_entity_object (object_id),
INDEX idx_ner_entity_status (status),
INDEX idx_ner_entity_type (entity_type),
INDEX idx_ner_entity_correction (correction_type),
INDEX idx_ner_entity_training (training_exported)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
-- Entity linking to AtoM actors
CREATE TABLE IF NOT EXISTS ahg_ner_entity_link (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
entity_id BIGINT UNSIGNED NOT NULL,
actor_id INT NOT NULL,
link_type ENUM('exact', 'fuzzy', 'manual') DEFAULT 'manual',
confidence DECIMAL(5,4) DEFAULT 1.0000,
created_by INT DEFAULT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_ner_link_entity (entity_id),
INDEX idx_ner_link_actor (actor_id),
FOREIGN KEY (entity_id) REFERENCES ahg_ner_entity(id) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
-- Translation queue
CREATE TABLE IF NOT EXISTS ahg_translation_queue (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
object_id INT NOT NULL,
source_culture VARCHAR(10) NOT NULL,
target_culture VARCHAR(10) NOT NULL,
fields TEXT NOT NULL COMMENT 'JSON array of fields to translate',
status ENUM('pending', 'processing', 'completed', 'failed') DEFAULT 'pending',
error_message TEXT DEFAULT NULL,
created_by INT DEFAULT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
processed_at TIMESTAMP NULL,
INDEX idx_translation_queue_status (status),
INDEX idx_translation_queue_object (object_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
-- Translation log
CREATE TABLE IF NOT EXISTS ahg_translation_log (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
object_id INT NOT NULL,
field_name VARCHAR(100) NOT NULL,
source_culture VARCHAR(10) NOT NULL,
target_culture VARCHAR(10) NOT NULL,
source_text TEXT DEFAULT NULL,
translated_text TEXT DEFAULT NULL,
translation_engine VARCHAR(50) DEFAULT 'argos',
created_by INT DEFAULT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_translation_log_object (object_id),
INDEX idx_translation_log_cultures (source_culture, target_culture)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
-- LLM Configuration (Ollama, OpenAI, Anthropic)
CREATE TABLE IF NOT EXISTS ahg_llm_config (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
provider VARCHAR(50) NOT NULL,
name VARCHAR(100) NOT NULL UNIQUE,
is_active TINYINT(1) DEFAULT 1,
is_default TINYINT(1) DEFAULT 0,
endpoint_url VARCHAR(500),
api_key_encrypted TEXT,
model VARCHAR(100) NOT NULL,
max_tokens INT DEFAULT 2000,
temperature DECIMAL(3,2) DEFAULT 0.70,
timeout_seconds INT DEFAULT 120,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
INDEX idx_llm_config_provider (provider),
INDEX idx_llm_config_default (is_default)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
-- Prompt Templates for LLM Description Suggestions
CREATE TABLE IF NOT EXISTS ahg_prompt_template (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
slug VARCHAR(100) NOT NULL UNIQUE,
system_prompt TEXT NOT NULL,
user_prompt_template TEXT NOT NULL,
level_of_description VARCHAR(50) DEFAULT NULL,
repository_id INT DEFAULT NULL,
is_default TINYINT(1) DEFAULT 0,
is_active TINYINT(1) DEFAULT 1,
include_ocr TINYINT(1) DEFAULT 1,
max_ocr_chars INT DEFAULT 8000,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_prompt_template_level (level_of_description),
INDEX idx_prompt_template_repo (repository_id),
INDEX idx_prompt_template_default (is_default)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
-- Description Suggestions with Review Workflow
CREATE TABLE IF NOT EXISTS ahg_description_suggestion (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
object_id INT NOT NULL,
suggested_text TEXT NOT NULL,
existing_text TEXT,
prompt_template_id INT UNSIGNED,
llm_config_id INT UNSIGNED,
source_data JSON,
status ENUM('pending','approved','rejected','edited') DEFAULT 'pending',
edited_text TEXT,
reviewed_by INT,
reviewed_at TIMESTAMP NULL,
review_notes TEXT,
generation_time_ms INT,
tokens_used INT,
model_used VARCHAR(100),
created_by INT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
expires_at TIMESTAMP NULL,
INDEX idx_suggestion_object (object_id),
INDEX idx_suggestion_status (status),
INDEX idx_suggestion_created (created_at)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
Default Settings¶
| Feature | Setting Key | Default Value | Description |
|---|---|---|---|
| general | api_url | http://192.168.0.112:5004/ai/v1 | AI API base URL |
| general | api_key | ahg_ai_demo_internal_2026 | API authentication key |
| general | api_timeout | 60 | Request timeout (seconds) |
| ner | enabled | 1 | Enable NER feature |
| ner | auto_link_exact | 0 | Auto-link exact matches |
| ner | confidence_threshold | 0.85 | Minimum confidence score |
| ner | enabled_entity_types | ["PERSON","ORG","GPE","DATE"] | Active entity types |
| summarize | enabled | 1 | Enable summarization |
| summarize | max_length | 1000 | Maximum summary length |
| summarize | min_length | 100 | Minimum summary length |
| summarize | target_field | scope_and_content | Field to populate |
| translate | enabled | 1 | Enable translation |
| translate | engine | argos | Translation engine |
| translate | supported_languages | ["en","af","fr","nl","pt","es","de"] | Available languages |
| spellcheck | enabled | 1 | Enable spellcheck |
| spellcheck | language | en | Default language |
| suggest | enabled | 1 | Enable LLM description suggestions |
| suggest | require_review | 1 | Require custodian review before applying |
| suggest | auto_expire_days | 30 | Auto-expire pending suggestions |
| suggest | default_llm_config | 1 | Default LLM configuration ID |
| suggest | default_template | 1 | Default prompt template ID |
Default LLM Configurations¶
| Provider | Name | Model | Endpoint | Default |
|---|---|---|---|---|
| ollama | Ollama Local | llama3.1:8b | http://localhost:11434 | Yes |
| openai | OpenAI | gpt-4o-mini | https://api.openai.com | No |
| anthropic | Anthropic | claude-3-haiku-20240307 | https://api.anthropic.com | No |
Default Prompt Templates¶
| Name | Slug | Use Case |
|---|---|---|
| Standard Archival | standard-archival | General archival descriptions |
| Item-Level OCR | item-level-ocr | Items with OCR text available |
| Photograph | photograph | Photograph descriptions |
Service Methods¶
ahgNerService¶
class ahgNerService
{
/**
* Extract named entities from text
* @param string $text Input text
* @param bool $clean Clean text before processing
* @return array ['success' => bool, 'entities' => ['PERSON' => [], 'ORG' => [], ...]]
*/
public function extract($text, $clean = true)
/**
* Extract entities from PDF file
* @param string $filePath Path to PDF file
* @return array
*/
public function extractFromPdf($filePath)
/**
* Generate summary from text
* @param string $text Input text
* @param int $maxLength Maximum summary length
* @param int $minLength Minimum summary length
* @return array ['success' => bool, 'summary' => string]
*/
public function summarize($text, $maxLength = 1000, $minLength = 100)
/**
* Generate summary from PDF
* @param string $filePath Path to PDF
* @param int $maxLength Maximum length
* @param int $minLength Minimum length
* @return array
*/
public function summarizeFromPdf($filePath, $maxLength = 1000, $minLength = 100)
/**
* Check API health status
* @return array ['status' => string, 'services' => [...]]
*/
public function health()
/**
* Get API usage statistics
* @return array
*/
public function usage()
/**
* Check if summarizer is available
* @return bool
*/
public function isSummarizerAvailable()
}
NerRepository¶
namespace ahgAIPlugin\Repository;
class NerRepository
{
/**
* Save extracted entities
* @param int $objectId Information object ID
* @param array $entities Entities grouped by type
* @param string $backend Backend used for extraction
* @return int Extraction ID
*/
public function saveExtraction(int $objectId, array $entities, string $backend = 'local'): int
/**
* Get pending entities for an object
* @param int $objectId
* @return array
*/
public function getPendingEntities(int $objectId): array
/**
* Get all entities for an object
* @param int $objectId
* @return array
*/
public function getEntities(int $objectId): array
/**
* Update entity status
* @param int $entityId
* @param string $status
* @param int|null $linkedActorId
* @param int|null $reviewedBy
* @return bool
*/
public function updateEntityStatus(int $entityId, string $status, ?int $linkedActorId = null, ?int $reviewedBy = null): bool
/**
* Find matching actors for entity value
* @param string $entityValue
* @param string $entityType
* @return array ['exact' => [], 'partial' => []]
*/
public function findMatchingActors(string $entityValue, string $entityType): array
/**
* Get extraction history for object
* @param int $objectId
* @return array
*/
public function getExtractionHistory(int $objectId): array
/**
* Get count of pending entities
* @return int
*/
public function getPendingCount(): int
}
NerTrainingSync¶
class NerTrainingSync
{
/**
* Get unexported corrections
* @param int $limit Maximum to retrieve
* @return Collection
*/
public function getUnexportedCorrections($limit = 500)
/**
* Get context around entity in source text
* @param int $objectId
* @param string $entityValue
* @param int $contextLength Characters before/after
* @return array|null ['text' => string, 'start' => int, 'end' => int]
*/
public function getEntityContext($objectId, $entityValue, $contextLength = 200)
/**
* Push corrections to central training server
* @return array ['status' => string, 'exported' => int]
*/
public function pushCorrections()
/**
* Export corrections to local file
* @param string|null $filename Custom filename
* @return array ['status' => string, 'file' => string, 'exported' => int]
*/
public function exportToFile($filename = null)
/**
* Get training statistics
* @return Collection
*/
public function getStats()
}
CLI Commands¶
ai:install¶
Install plugin database tables.
ai:uninstall¶
Remove plugin (optionally keeping data).
ai:ner-extract¶
Extract named entities from records.
php symfony ai:ner-extract [options]
Options:
--object=ID Process specific object ID
--repository=ID Process all objects in repository
--all Process all unprocessed objects
--uploaded-today Process objects uploaded today
--limit=N Maximum to process (default: 100)
--dry-run Show what would be processed
--queue Queue jobs instead of direct processing
--with-pdf Extract text from PDFs
ai:translate¶
Translate records between languages.
php symfony ai:translate [options]
Options:
--from=LANG Source culture (e.g., en) [required]
--to=LANG Target culture (e.g., af) [required]
--object=ID Translate specific object
--repository=ID Translate all in repository
--fields=LIST Fields to translate (comma-separated)
--limit=N Maximum to translate (default: 100)
--dry-run Show what would be translated
--install-package Install language package if missing
ai:summarize¶
Generate summaries for records.
php symfony ai:summarize [options]
Options:
--object=ID Process specific object ID
--repository=ID Process all in repository
--all-empty Process records with empty summary
--limit=N Maximum to process (default: 100)
--dry-run Show what would be processed
--field=NAME Target field (default: scope_and_content)
ai:spellcheck¶
Check spelling in metadata fields.
php symfony ai:spellcheck [options]
Options:
--object=ID Check specific object ID
--repository=ID Check all in repository
--all Check all objects
--limit=N Maximum to check (default: 100)
--dry-run Show what would be checked
--language=CODE Language code (e.g., en_ZA)
ai:ner-sync¶
Sync NER corrections to training server.
php symfony ai:ner-sync [options]
Options:
--export-file Export to file instead of pushing to server
--stats Show training statistics only
ai:process-pending¶
Process pending AI extraction queue (fallback for Gearman).
php symfony ai:process-pending [options]
Options:
--limit=N Maximum items to process (default: 50)
--task-type=TYPE Task type to process: ner, summarize (default: ner)
--dry-run Preview without processing
Examples:
# Process up to 50 pending NER extractions
php symfony ai:process-pending --limit=50
# Preview what would be processed
php symfony ai:process-pending --dry-run
# Process pending summarization tasks
php symfony ai:process-pending --task-type=summarize
ai:suggest-description¶
Generate AI-powered description suggestions using LLM providers.
php symfony ai:suggest-description [options]
Options:
--object=ID Process specific object ID
--repository=ID Process all objects in repository
--level=LEVEL Filter by level of description (e.g., item, file)
--empty-only Only process records with empty scope_and_content
--with-ocr Only process records that have OCR text
--limit=N Maximum to process (default: 50)
--template=ID Prompt template ID to use
--llm-config=ID LLM configuration ID to use
--dry-run Show what would be processed without generating
--delay=MS Delay between API calls in milliseconds (default: 1000)
Examples:
# Preview what would be processed
php symfony ai:suggest-description --repository=5 --empty-only --dry-run
# Generate suggestions for items with OCR text
php symfony ai:suggest-description --with-ocr --limit=20
# Process specific object
php symfony ai:suggest-description --object=12345
# Use specific template and LLM config
php symfony ai:suggest-description --template=2 --llm-config=1 --limit=10
Cron Job Scheduling¶
Recommended cron entries for automated description suggestions:
# Generate suggestions for records with empty scope_and_content (daily at 2am)
0 2 * * * cd /usr/share/nginx/atom && php symfony ai:suggest-description --empty-only --limit=100 >> /var/log/atom/ai-suggest.log 2>&1
# Generate suggestions for records with OCR text (weekly on Sunday at 3am)
0 3 * * 0 cd /usr/share/nginx/atom && php symfony ai:suggest-description --with-ocr --limit=200 >> /var/log/atom/ai-suggest-ocr.log 2>&1
# Cleanup expired suggestions (monthly on 1st at 4am)
0 4 1 * * cd /usr/share/nginx/atom && php symfony ai:suggest-description --cleanup >> /var/log/atom/ai-suggest-cleanup.log 2>&1
Routes¶
Primary Routes¶
| Route | URL | Action |
|---|---|---|
| ahg_ai_ner_extract | /ai/ner/extract/:id | Extract entities from object |
| ahg_ai_ner_review | /ai/ner/review | Review dashboard |
| ahg_ai_ner_entities | /ai/ner/entities/:id | Get entities for object |
| ahg_ai_ner_bulk_save | /ai/ner/bulk-save | Save multiple entity decisions |
| ahg_ai_summarize | /ai/summarize/:id | Generate summary |
| ahg_ai_translate | /ai/translate/:id | Translate record |
| ahg_ai_htr | /ai/htr/:id | Handwriting text recognition |
| ahg_ai_health | /ai/health | API health check |
| ahg_ai_suggest | /ai/suggest/:id | Generate description suggestion |
| ahg_ai_suggest_review | /ai/suggest/review | Suggestion review dashboard |
| ahg_ai_suggest_view | /ai/suggest/:id/view | View specific suggestion |
| ahg_ai_suggest_decision | /ai/suggest/:id/decision | Approve/reject suggestion |
| ahg_ai_suggest_object | /ai/suggest/object/:id | Get suggestions for object |
| ahg_ai_llm_configs | /ai/llm/configs | List LLM configurations |
| ahg_ai_llm_health | /ai/llm/health | Check LLM provider health |
| ahg_ai_templates | /ai/templates | List prompt templates |
Legacy Routes (Backward Compatibility)¶
| Route | URL | Action |
|---|---|---|
| ahg_ner_extract | /ner/extract/:id | NER extraction |
| ahg_ner_review | /ner/review | Review dashboard |
| ahg_ner_entities | /ner/entities/:id | Get entities |
| ahg_ner_bulk_save | /ner/bulk-save | Bulk save |
| ahg_ner_summarize | /ner/summarize/:id | Summarization |
Action Methods¶
aiActions Class¶
class aiActions extends sfActions
{
// NER Actions
public function executeExtract(sfWebRequest $request) // Extract entities
public function executeReview(sfWebRequest $request) // Review dashboard
public function executeGetEntities(sfWebRequest $request) // Get entities JSON
public function executeUpdateEntity(sfWebRequest $request)// Update single entity
public function executeBulkSave(sfWebRequest $request) // Bulk save decisions
public function executeCreateActor(sfWebRequest $request) // Create actor from entity
public function executeCreatePlace(sfWebRequest $request) // Create place term
public function executeCreateSubject(sfWebRequest $request)// Create subject term
// Summarization
public function executeSummarize(sfWebRequest $request) // Generate summary
// Translation
public function executeTranslate(sfWebRequest $request) // Translate record
public function executeTranslateLanguages(sfWebRequest $request) // Get languages
// HTR
public function executeHtr(sfWebRequest $request) // Handwriting recognition
// Health
public function executeHealth(sfWebRequest $request) // API health check
// LLM Description Suggestions
public function executeSuggest(sfWebRequest $request) // Generate suggestion
public function executeSuggestReview(sfWebRequest $request) // Review dashboard
public function executeSuggestView(sfWebRequest $request) // View suggestion
public function executeSuggestDecision(sfWebRequest $request)// Approve/reject
public function executeSuggestObject(sfWebRequest $request) // Get for object
public function executeSuggestPreview(sfWebRequest $request) // Preview context
public function executeLlmConfigs(sfWebRequest $request) // List LLM configs
public function executeLlmHealth(sfWebRequest $request) // Check LLM health
public function executeTemplates(sfWebRequest $request) // List templates
}
LLM Description Suggestion¶
Overview¶
The LLM Description Suggestion feature uses large language models (Ollama, OpenAI, or Anthropic) to generate scope_and_content descriptions from OCR text, metadata, and digital object information. All suggestions require custodian review before application.
Workflow¶
1. User clicks "Suggest Description (AI)" button on record
↓
2. System gathers context (title, identifier, dates, OCR text, etc.)
↓
3. System selects appropriate prompt template (by level/repository)
↓
4. System calls LLM via configured provider (Ollama/OpenAI/Anthropic)
↓
5. Modal displays side-by-side comparison:
[Current Description] [AI Suggestion (editable)]
↓
6. Custodian reviews, optionally edits, adds notes
↓
7. Custodian clicks Approve → saves to scope_and_content
OR clicks Reject → suggestion marked rejected
Batch Selection Methods¶
The batch selection supports multiple input methods:
- Explicit Object IDs - Direct array of object IDs
- Repository Filter - All objects within a specific repository
- Search Query - Full-text search across records
Search-Based Selection:
// Search for records matching a query string
} elseif (!empty($data['search_query'])) {
$searchTerm = '%' . trim($data['search_query']) . '%';
$query = DB::table('information_object')
->join('information_object_i18n', 'information_object.id', '=', 'information_object_i18n.id')
->where('information_object.id', '!=', 1)
->where('information_object_i18n.culture', '=', 'en')
->where(function ($q) use ($searchTerm) {
$q->where('information_object_i18n.title', 'LIKE', $searchTerm)
->orWhere('information_object_i18n.scope_and_content', 'LIKE', $searchTerm)
->orWhere('information_object.identifier', 'LIKE', $searchTerm);
});
// Optional repository filter
if (!empty($data['repository_id'])) {
$query->where('information_object.repository_id', '=', (int) $data['repository_id']);
}
// Optional level of description filter
if (!empty($data['level_id'])) {
$query->where('information_object.level_of_description_id', '=', (int) $data['level_id']);
}
$objectIds = $query->limit($limit)->pluck('information_object.id')->toArray();
}
The search functionality enables flexible record selection for batch AI operations, allowing users to target records by keyword, repository, and/or level of description.
Service Classes¶
LlmProviderInterface¶
Provider contract for all LLM backends.
namespace ahgAIPlugin\Services;
interface LlmProviderInterface
{
public function complete(string $systemPrompt, string $userPrompt, array $options = []): array;
public function isAvailable(): bool;
public function getName(): string;
public function getModels(): array;
public function getHealth(): array;
}
LlmService¶
Factory/orchestrator for LLM providers.
class LlmService
{
public function getProvider(?int $configId = null): LlmProviderInterface;
public function complete(string $systemPrompt, string $userPrompt, ?int $configId = null, array $options = []): array;
public function getConfigurations(): array;
public function getDefaultConfig(): ?object;
public static function encryptApiKey(string $key): string;
public static function decryptApiKey(string $encrypted): string;
}
PromptService¶
Template management and variable substitution.
class PromptService
{
public function getTemplateForObject(int $objectId, ?int $templateId = null): ?object;
public function getDefaultTemplate(): ?object;
public function buildPrompt(object $template, array $context): array;
public function getTemplates(): array;
}
Template Variables:
- {title} - Record title
- {identifier} - Reference code/identifier
- {level_of_description} - Level (fonds, series, file, item)
- {date_range} - Date expression
- {creator} - Creator name
- {repository} - Repository name
- {ocr_text} - Full text from OCR
- {existing_metadata} - All available metadata
DescriptionService¶
Main orchestrator for generating and managing suggestions.
class DescriptionService
{
public function generateSuggestion(int $objectId, ?int $templateId = null, ?int $llmConfigId = null): array;
public function gatherContext(int $objectId): array;
public function saveSuggestion(int $objectId, array $result, ?object $template, ?object $config, array $context): int;
public function approveSuggestion(int $suggestionId, int $userId, ?string $editedText = null, ?string $notes = null): bool;
public function rejectSuggestion(int $suggestionId, int $userId, ?string $notes = null): bool;
public function getPendingSuggestions(?int $repositoryId = null, int $limit = 50): array;
public function getSuggestion(int $id): ?object;
}
LLM Providers¶
OllamaProvider¶
Local LLM via Ollama server.
// Configuration
$config = [
'provider' => 'ollama',
'endpoint_url' => 'http://localhost:11434',
'model' => 'llama3.1:8b',
'max_tokens' => 2000,
'temperature' => 0.7
];
Ollama Setup:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Start server
ollama serve
# Pull model
ollama pull llama3.1:8b
# Verify
curl http://localhost:11434/api/tags
OpenAIProvider¶
OpenAI API integration.
// Configuration
$config = [
'provider' => 'openai',
'endpoint_url' => 'https://api.openai.com',
'api_key_encrypted' => '[encrypted key]',
'model' => 'gpt-4o-mini',
'max_tokens' => 2000,
'temperature' => 0.7
];
AnthropicProvider¶
Anthropic Claude API integration.
// Configuration
$config = [
'provider' => 'anthropic',
'endpoint_url' => 'https://api.anthropic.com',
'api_key_encrypted' => '[encrypted key]',
'model' => 'claude-3-haiku-20240307',
'max_tokens' => 2000,
'temperature' => 0.7
];
OCR Integration¶
OCR text is retrieved from the iiif_ocr_text table (from ahgIiifPlugin):
$ocr = DB::table('iiif_ocr_text')
->where('digital_object_id', $digitalObjectId)
->first();
$ocrText = $ocr->full_text;
API Response Format¶
Suggestion Generation:
{
"success": true,
"suggestion_id": 123,
"suggested_text": "This collection contains...",
"model_used": "llama3.1:8b",
"tokens_used": 450,
"generation_time_ms": 2340,
"context": {
"has_ocr": true,
"ocr_length": 5420,
"fields_used": ["title", "identifier", "date", "ocr_text"]
}
}
Suggestion Decision:
{
"success": true,
"message": "Suggestion approved and applied",
"object_id": 12345,
"applied_text": "The edited description text..."
}
Entity Types and Linking¶
Entity Types¶
| Type | Description | Links To |
|---|---|---|
| PERSON | Individual names | Actor (entityType: PERSON) |
| ORG | Organizations | Actor (entityType: CORPORATE_BODY) |
| GPE | Places/Locations | Term (taxonomy: PLACE) |
| DATE | Dates and periods | Term (taxonomy: SUBJECT) or Event |
Linking Process¶
Entity Extraction
|
v
+---------------------+
| Pending Status |
| (awaiting review) |
+---------------------+
|
v
+---------------------+ +----------------------+
| Review Action |---->| Create New |
| | | - Actor (PERSON/ORG) |
| | | - Place Term |
| | | - Subject Term |
| | +----------------------+
| | |
| | +----------------------+
| |---->| Link to Existing |
| | | - Actor match |
| | | - Term match |
| | +----------------------+
| | |
| | +----------------------+
| |---->| Approve (no link) |
| | +----------------------+
| | |
| | +----------------------+
| +---->| Reject |
+---------------------+ +----------------------+
|
v
+------------------------+
| Status Updated |
| correction_type set |
| (for training) |
+------------------------+
Background Jobs¶
arNerExtractJob¶
Gearman job for background NER extraction and summarization.
class arNerExtractJob extends arBaseJob
{
protected $extraRequiredParameters = ['objectId'];
/**
* Parameters:
* objectId - Information object to process
* runNer - Run NER extraction (default: true)
* runSummarize - Run summarization (default: false)
* runSpellCheck - Run spell check (default: false)
*/
public function runJob($parameters)
}
Auto-trigger NER on Document Upload¶
Overview¶
The plugin can automatically trigger NER extraction when digital objects are uploaded. This feature listens to the QubitDigitalObject::insert event and queues NER extraction jobs for processable document types.
How It Works¶
1. User uploads digital object (PDF, DOCX, etc.)
↓
2. Symfony event dispatcher fires QubitDigitalObject::insert
↓
3. Plugin checks if auto-trigger is enabled
↓
4. Plugin checks if MIME type is processable
↓
5a. If Gearman available → Queue job to arNerExtractJob
5b. If Gearman unavailable → Add to ahg_ai_pending_extraction table
↓
6. Job processed (immediately by Gearman or later by cron)
Processable MIME Types¶
| MIME Type | Description |
|---|---|
| application/pdf | PDF documents |
| text/plain | Plain text files |
| text/html | HTML documents |
| application/msword | Word documents (.doc) |
| application/vnd.openxmlformats-officedocument.wordprocessingml.document | Word documents (.docx) |
| application/rtf | Rich text format |
Configuration¶
Auto-trigger is controlled by the setting auto_extract_on_upload in ahg_ner_settings:
| Setting Key | Value | Description |
|---|---|---|
| auto_extract_on_upload | 1 | Enable auto-trigger on document upload |
| auto_extract_on_upload | 0 | Disable auto-trigger (default) |
To enable via SQL:
INSERT INTO ahg_ner_settings (setting_key, setting_value)
VALUES ('auto_extract_on_upload', '1')
ON DUPLICATE KEY UPDATE setting_value = '1';
To enable via UI:
Navigate to Admin > AHG Settings > AI Services > NER and enable "Auto-extract on upload".
Fallback Queue Tables¶
When Gearman is unavailable, jobs are queued in the database:
ahg_ai_pending_extraction¶
| Column | Type | Description |
|---|---|---|
| id | BIGINT | Primary key |
| object_id | INT | Information object ID |
| digital_object_id | INT | Digital object that triggered extraction |
| task_type | VARCHAR(50) | Task type (ner, summarize) |
| status | ENUM | pending, processing, completed, failed |
| attempt_count | INT | Retry attempt counter |
| error_message | TEXT | Last error message |
| created_at | TIMESTAMP | Queue time |
| processed_at | TIMESTAMP | Completion time |
ahg_ai_auto_trigger_log¶
Audit log for auto-triggered extractions:
| Column | Type | Description |
|---|---|---|
| id | BIGINT | Primary key |
| object_id | INT | Information object ID |
| digital_object_id | INT | Digital object ID |
| task_type | VARCHAR(50) | Task type |
| status | VARCHAR(50) | queued or pending |
| created_at | TIMESTAMP | Event time |
Cron Job for Pending Queue¶
If Gearman is not available, run the pending queue processor via cron:
# Process pending NER extractions every 5 minutes
*/5 * * * * cd /usr/share/nginx/atom && php symfony ai:process-pending --limit=20 >> /var/log/atom/ai-pending.log 2>&1
The ai:process-pending command:
- Fetches pending items from ahg_ai_pending_extraction
- Processes each item using the appropriate service
- Automatically retries failed items up to 3 times
- Marks items as completed or failed after processing
Implementation Details¶
Event Hook (in ahgAIPluginConfiguration.class.php):
public function initialize()
{
// ... other initialization ...
// Auto-trigger NER on digital object upload (Issue #19)
$this->dispatcher->connect('QubitDigitalObject::insert', [$this, 'onDigitalObjectInsert']);
}
public function onDigitalObjectInsert(sfEvent $event)
{
$digitalObject = $event->getSubject();
$objectId = $digitalObject->objectId;
if (!$this->isAutoTriggerEnabled()) {
return;
}
if (!$this->isProcessableMimeType($digitalObject->mimeType ?? '')) {
return;
}
$this->queueNerExtraction($objectId, $digitalObject->id);
}
Python Integration¶
Translation (Argos Translate)¶
Location: atom-ahg-python/src/atom_ahg/resources/translation.py
# List installed language packages
python translation.py list
# Install language package
python translation.py install --from=en --to=af
# Translate text
python translation.py translate "Hello world" --from=en --to=af
NER (spaCy)¶
Location: atom-ahg-python/src/atom_ahg/resources/ner.py
Uses spaCy models for entity extraction: - en_core_web_sm (English small) - en_core_web_lg (English large - better accuracy)
Summarization¶
Location: atom-ahg-python/src/atom_ahg/resources/summarize.py
Uses transformer-based summarization models.
API Endpoints¶
AI API (Python Backend)¶
| Endpoint | Method | Description |
|---|---|---|
| /ai/v1/health | GET | Health check |
| /ai/v1/ner/extract | POST | Extract entities from text |
| /ai/v1/ner/extract-pdf | POST | Extract entities from PDF |
| /ai/v1/summarize | POST | Summarize text |
| /ai/v1/summarize-pdf | POST | Summarize PDF content |
| /ai/v1/translate | POST | Translate text |
| /ai/v1/translate/languages | GET | List available languages |
| /ai/v1/htr | POST | Handwriting text recognition |
Request/Response Examples¶
NER Extract:
// Request
POST /ai/v1/ner/extract
{
"text": "John Smith met with UNESCO representatives in London on 15 January 2024.",
"clean": true
}
// Response
{
"success": true,
"entities": {
"PERSON": ["John Smith"],
"ORG": ["UNESCO"],
"GPE": ["London"],
"DATE": ["15 January 2024"]
},
"entity_count": 4,
"processing_time_ms": 245
}
Summarize:
// Request
POST /ai/v1/summarize
{
"text": "[long document text]",
"max_length": 500,
"min_length": 100
}
// Response
{
"success": true,
"summary": "Concise summary of the document...",
"original_length": 15000,
"summary_length": 450,
"processing_time_ms": 1234
}
Translate:
// Request
POST /ai/v1/translate
{
"text": "Hello world",
"source": "en",
"target": "af"
}
// Response
{
"success": true,
"translated": "Hallo wereld",
"source": "en",
"target": "af"
}
Training Feedback System¶
Correction Types¶
| Type | Description | Used When |
|---|---|---|
| none | No correction made | Entity accepted as-is and linked |
| value_edit | Value was corrected | User edited entity text |
| type_change | Type was changed | User changed entity type |
| both | Value and type changed | Both edited |
| approved | Approved without link | Marked correct but not linked |
| rejected | Marked incorrect | Entity rejected as not valid |
Export Format¶
{
"site_id": "abc123...",
"site_name": "My Archive",
"exported_at": "2026-01-30T14:32:15+00:00",
"total_corrections": 45,
"corrections": [
{
"entity_id": 123,
"original_value": "J Smith",
"corrected_value": "John Smith",
"original_type": "PERSON",
"corrected_type": "PERSON",
"correction_type": "value_edit",
"confidence": 0.85,
"context": {
"text": "...letter from J Smith regarding...",
"start": 12,
"end": 20
},
"reviewed_at": "2026-01-30T10:15:00"
}
]
}
File Structure¶
ahgAIPlugin/
+-- config/
| +-- ahgAIPluginConfiguration.class.php # Plugin configuration
| +-- routing.yml # Route definitions
| +-- settings.yml # Module settings
+-- database/
| +-- install.sql # Database schema
+-- lib/
| +-- job/
| | +-- arNerExtractJob.class.php # Background job
| | +-- ahgMediaTranscriptionJob.class.php
| +-- repository/
| | +-- NerRepository.php # Data access layer
| +-- Services/
| | +-- NerService.php # Core NER service
| | +-- ahgFaceDetectionService.php # Face detection
| | +-- LlmProviderInterface.php # LLM provider contract
| | +-- LlmService.php # LLM factory/orchestrator
| | +-- PromptService.php # Prompt template management
| | +-- DescriptionService.php # Description suggestion orchestrator
| | +-- providers/
| | +-- OllamaProvider.php # Ollama local LLM
| | +-- OpenAIProvider.php # OpenAI API
| | +-- AnthropicProvider.php # Anthropic API
| +-- task/
| | +-- aiInstallTask.class.php # Install CLI
| | +-- aiUninstallTask.class.php # Uninstall CLI
| | +-- aiNerExtractTask.class.php # NER extract CLI
| | +-- aiTranslateTask.class.php # Translate CLI
| | +-- aiSummarizeTask.class.php # Summarize CLI
| | +-- aiSpellcheckTask.class.php # Spellcheck CLI
| | +-- aiNerSyncTask.class.php # Training sync CLI
| | +-- aiSuggestDescriptionTask.class.php # Description suggestion CLI
| +-- NerTrainingSync.class.php # Training data sync
+-- modules/
| +-- ai/
| +-- actions/
| | +-- actions.class.php # Controller actions
| | +-- components.class.php # View components
| +-- config/
| | +-- module.yml
| | +-- security.yml
| +-- templates/
| +-- _aiTools.php # AI tools sidebar
| +-- _extractButton.php # Extract button
| +-- _summarizeButton.php # Summarize button
| +-- _suggestButton.php # Suggest description button
| +-- reviewSuccess.php # NER review dashboard
| +-- suggestReviewSuccess.php # Suggestion review dashboard
+-- extension.json # Plugin metadata
Dependencies¶
PHP Dependencies¶
- atom-framework (Laravel Query Builder)
- ahgCorePlugin (AhgDb initialization)
- php-curl (for LLM API calls)
- openssl (for API key encryption)
System Dependencies¶
- pdftotext (poppler-utils) - PDF text extraction
- aspell - Spellchecking
- python3 - AI backend
Python Dependencies¶
- argostranslate>=1.9.0 - Offline translation
- spacy - NER extraction
- transformers - Summarization models
LLM Providers (Optional - at least one required for suggestions)¶
- Ollama (recommended for local/privacy)
- Install:
curl -fsSL https://ollama.com/install.sh | sh - Models: llama3.1:8b, llama3.1:70b, mistral, mixtral
- OpenAI API (cloud)
- Requires API key from https://platform.openai.com
- Models: gpt-4o-mini, gpt-4o, gpt-4-turbo
- Anthropic API (cloud)
- Requires API key from https://console.anthropic.com
- Models: claude-3-haiku, claude-3-sonnet, claude-3-opus
Security¶
Access Control¶
- All actions require authentication
- Review actions require editor role or higher
- Settings modifications require administrator role
API Security¶
- API key authentication via X-API-Key header
- Rate limiting configured at API level
- Request/response logging for audit
Performance Considerations¶
Batch Processing¶
- CLI commands support --limit option
- Background jobs for large volumes
- Batch entity saving (3 at a time in UI)
Caching¶
- Settings cached in session
- API responses not cached (real-time results)
Timeouts¶
- Default API timeout: 60 seconds
- PDF processing timeout: 120 seconds
- HTR processing timeout: 120 seconds
Migration from Separate Plugins¶
From arNerPlugin¶
-- Settings migration (automatic in install.sql)
INSERT IGNORE INTO ahg_ai_settings (feature, setting_key, setting_value)
SELECT 'ner', setting_key, setting_value
FROM ahg_ner_settings
WHERE setting_key NOT IN ('api_url', 'api_key', 'api_timeout');
Route Compatibility¶
Legacy routes (/ner/) are maintained for backward compatibility and redirect to new routes (/ai/ner/).
AI Job Queue¶
Overview¶
The Job Queue system enables batch processing of AI tasks (NER, Summarize, Suggest, Translate, Spellcheck, OCR) on multiple records with progress tracking, throttling, and retry capability.
Database Tables¶
ahg_ai_batch¶
Batch job container.
| Column | Type | Description |
|---|---|---|
| id | BIGINT | Primary key |
| name | VARCHAR(255) | Batch name |
| description | TEXT | Optional description |
| task_types | JSON | Array of task types to run |
| status | ENUM | pending, running, paused, completed, failed, cancelled |
| priority | TINYINT | 1-10, lower is higher priority |
| total_items | INT | Total jobs in batch |
| completed_items | INT | Successfully completed jobs |
| failed_items | INT | Failed jobs |
| progress_percent | DECIMAL(5,2) | Calculated progress |
| max_concurrent | INT | Max parallel jobs (default: 5) |
| delay_between_ms | INT | Delay between jobs (default: 1000) |
| max_retries | INT | Retry attempts (default: 3) |
| scheduled_at | TIMESTAMP | Optional scheduled start |
| started_at | TIMESTAMP | Actual start time |
| completed_at | TIMESTAMP | Completion time |
ahg_ai_job¶
Individual job items.
| Column | Type | Description |
|---|---|---|
| id | BIGINT | Primary key |
| batch_id | BIGINT | Foreign key to batch |
| object_id | INT | Information object ID |
| task_type | VARCHAR(50) | ner, summarize, suggest, translate, spellcheck, ocr |
| status | ENUM | pending, queued, running, completed, failed, skipped |
| gearman_handle | VARCHAR(255) | Gearman job handle |
| attempt_count | INT | Current retry count |
| result_data | JSON | Task results |
| error_message | TEXT | Error details if failed |
| processing_time_ms | INT | Execution time |
ahg_ai_job_log¶
Event log for auditing.
| Column | Type | Description |
|---|---|---|
| id | BIGINT | Primary key |
| batch_id | BIGINT | Batch reference |
| job_id | BIGINT | Job reference (optional) |
| event_type | VARCHAR(50) | Event type identifier |
| message | TEXT | Human-readable message |
| details | JSON | Additional context |
| created_at | TIMESTAMP | Event timestamp |
Service: JobQueueService¶
Path: lib/Services/JobQueueService.php
Task Types¶
JobQueueService::TASK_NER // Named entity extraction
JobQueueService::TASK_SUMMARIZE // Text summarization
JobQueueService::TASK_SUGGEST // LLM description suggestion
JobQueueService::TASK_TRANSLATE // Machine translation
JobQueueService::TASK_SPELLCHECK // Spelling check
JobQueueService::TASK_OCR // OCR text extraction
Key Methods¶
// Batch management
createBatch(array $data): int
addItemsToBatch(int $batchId, array $objectIds, array $taskTypes): int
startBatch(int $batchId): bool
pauseBatch(int $batchId): bool
resumeBatch(int $batchId): bool
cancelBatch(int $batchId): bool
deleteBatch(int $batchId): bool
// Job processing
queueJob(int $jobId): bool // Queue to Gearman
processJob(int $jobId): array
retryFailed(int $batchId): int
// Progress tracking
getBatch(int $batchId): ?object
getBatches(array $filters, int $limit): array
getBatchStats(int $batchId): array
getBatchJobs(int $batchId, array $filters, int $limit): array
updateBatchProgress(int $batchId, bool $hasFailed): void
// Server load protection
checkServerLoad(): bool // Checks CPU < 80%
Gearman Worker¶
Path: lib/job/arAiBatchJob.class.php
The Gearman worker processes jobs asynchronously:
class arAiBatchJob extends arBaseJob
{
protected $extraRequiredParameters = ['jobId'];
public function runJob($parameters)
{
$service = new JobQueueService();
// Check server load before processing
if (!$service->checkServerLoad()) {
sleep(10); // Back off under high load
}
return $service->processJob($parameters['jobId']);
}
}
Web Routes¶
| Route | Method | Action | Description |
|---|---|---|---|
| /ai/batch | GET | batch | Job queue dashboard |
| /ai/batch/create | POST | batchCreate | Create new batch |
| /ai/batch/:id | GET | batchView | View batch details |
| /ai/batch/:id/progress | GET | batchProgress | AJAX progress update |
| /ai/batch/:id/action | POST | batchAction | Start/pause/resume/cancel/retry |
| /ai/batch/:id/process | POST | batchProcess | Process next jobs (cron) |
| /ai/job/:id | GET | jobView | View job details |
UI Components¶
Batch Dashboard (batchSuccess.php)¶
- Stats cards (pending, running, completed, failed)
- Batch list with progress bars
- Create Batch modal
- Task type selection (checkboxes)
- Object selection (by repository or IDs)
- Advanced options (concurrency, delay, retries)
- Auto-start option
Batch View (batchViewSuccess.php)¶
- Progress bar with real-time updates
- Stats breakdown (total, pending, running, completed, failed)
- Job list table with filtering
- Action buttons (start, pause, resume, cancel, retry)
- Activity log
CLI Command¶
php symfony ai:batch [options]
Options:
--create Create a new batch
--name="Batch Name" Batch name (required with --create)
--tasks=ner,summarize Comma-separated task types
--repository=ID Process objects from repository
--object-ids=1,2,3 Specific object IDs
--limit=100 Maximum objects
--start Auto-start after creation
--status Show batch status
--process Process pending jobs (cron mode)
Workflow¶
1. User creates batch via UI or CLI
↓
2. System adds job records for each object × task type
↓
3. User starts batch (or auto-start)
↓
4. Jobs queued to Gearman (or processed by cron)
↓
5. Worker checks server load
↓
6. Worker executes task via appropriate service
↓
7. Progress updated, batch checked for completion
↓
8. Failed jobs can be retried
Resource Protection¶
- Max Concurrent: Limits parallel jobs (default: 5)
- Delay Between Jobs: Prevents server overload (default: 1000ms)
- Server Load Check: Pauses if CPU > 80%
- Timeouts: Individual job timeout protection
- Max Retries: Automatic retry with exponential backoff
Example: Create Batch via API¶
fetch('/ai/batch/create', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
name: 'NER for Repository X',
task_types: ['ner', 'summarize'],
repository_id: 123,
limit: 500,
max_concurrent: 3,
delay_between_ms: 2000,
auto_start: true
})
})
.then(r => r.json())
.then(data => {
if (data.success) {
console.log('Batch created:', data.batch_id);
window.location.href = '/ai/batch/' + data.batch_id;
}
});
Part of the AtoM AHG Framework