ahgPreservationPlugin - Technical Documentation¶
Overview¶
The ahgPreservationPlugin provides comprehensive digital preservation capabilities for AtoM, implementing OAIS (Open Archival Information System) and PREMIS (Preservation Metadata Implementation Strategies) standards. The plugin includes PRONOM-based format identification (using Siegfried), virus scanning, format conversion, backup verification, and replication features.
Version: 1.5.0 Category: Preservation Dependencies: atom >= 2.8.0, PHP >= 8.1
Optional Dependencies: - Siegfried (PRONOM format identification) - ClamAV (virus scanning) - ImageMagick (image conversion) - FFmpeg (audio/video conversion) - LibreOffice (document conversion) - Ghostscript (PDF processing)
Architecture¶
Component Diagram¶
+-------------------------------------------------------------------------+
| ahgPreservationPlugin |
+-------------------------------------------------------------------------+
| |
| +-------------------------------------------------------------------+ |
| | PRESENTATION LAYER | |
| +-------------------------------------------------------------------+ |
| | +----------+ +----------+ +----------+ +----------+ +----------+ | |
| | |Dashboard | |Identific | |VirusScan | |Conversion| | Backup | | |
| | | Template | | Template | | Template | | Template | | Template | | |
| | +----------+ +----------+ +----------+ +----------+ +----------+ | |
| +-------------------------------------------------------------------+ |
| | |
| v |
| +-------------------------------------------------------------------+ |
| | CONTROLLER LAYER | |
| +-------------------------------------------------------------------+ |
| | actions.class.php | |
| | +------------+ +------------+ +------------+ +------------+ | |
| | |executeIndex| |executeVirus| |executeConv | |executeBackup| | |
| | +------------+ +------------+ +------------+ +------------+ | |
| +-------------------------------------------------------------------+ |
| | |
| v |
| +-------------------------------------------------------------------+ |
| | SERVICE LAYER | |
| +-------------------------------------------------------------------+ |
| | PreservationService.php | |
| | +---------------------------------------------------------------+ | |
| | | Checksum Operations | Fixity Verification | | | |
| | | generateChecksums() | verifyFixity() | | | |
| | | getChecksums() | runBatchFixityCheck() | | | |
| | +---------------------------------------------------------------+ | |
| | | Format Identification | Virus Scanning | | | |
| | | identifyFormat() | scanForVirus() | | | |
| | | runBatchIdentification | runBatchVirusScan() | | | |
| | | isSiegfriedAvailable() | isClamAvAvailable() | | | |
| | +---------------------------------------------------------------+ | |
| | | Format Conversion | Backup & Replication | | | |
| | | convertFormat() | verifyBackup() | | | |
| | | getConversionTools() | verifyAllBackups() | | | |
| | | selectConversionTool() | getStatistics() | | | |
| | +---------------------------------------------------------------+ | |
| +-------------------------------------------------------------------+ |
| | |
| v |
| +-------------------------------------------------------------------+ |
| | CLI TASKS | |
| +-------------------------------------------------------------------+ |
| | preservationIdentifyTask | preservationFixityTask | |
| | preservationVirusScanTask | preservationConvertTask | |
| | preservationVerifyBackupTask| preservationReplicateTask | |
| +-------------------------------------------------------------------+ |
| | |
| v |
| +-------------------------------------------------------------------+ |
| | DATA LAYER | |
| +-------------------------------------------------------------------+ |
| | Illuminate\Database\Capsule\Manager (Laravel QB) | |
| +-------------------------------------------------------------------+ |
| |
+-------------------------------------------------------------------------+
Database Schema¶
Entity Relationship Diagram¶
+-------------------------------------------------------------------------+
| PRESERVATION DATABASE SCHEMA |
+-------------------------------------------------------------------------+
+----------------------+ +----------------------+
| digital_object | | information_object |
| (AtoM Core Table) | | (AtoM Core Table) |
+----------------------+ +----------------------+
| PK id | | PK id |
| name | | ... |
| path | +----------------------+
| byte_size | ^
| object_id --------+--------------------|
+----------------------+
|
| 1:N
v
+----------------------+ +----------------------+
| preservation_checksum| |preservation_fixity_ |
+----------------------+ | check |
| PK id | +----------------------+
| FK digital_object_id |<----| PK id |
| algorithm | | FK digital_object_id |
| checksum_value | | FK checksum_id |
| file_size | | algorithm |
| generated_at | | expected_value |
| verified_at | | actual_value |
| verification_ | | status |
| status | | error_message |
| created_at | | checked_at |
| updated_at | | checked_by |
+----------------------+ | duration_ms |
| created_at |
+----------------------+
+----------------------+ +----------------------+
| preservation_virus_ | |preservation_format_ |
| scan | | conversion |
+----------------------+ +----------------------+
| PK id | | PK id |
| FK digital_object_id | | FK digital_object_id |
| scan_engine | | source_format |
| engine_version | | source_mime_type |
| signature_version | | target_format |
| status | | target_mime_type |
| threat_name | | conversion_tool |
| file_path | | tool_version |
| file_size | | status |
| scanned_at | | source_path |
| scanned_by | | source_size |
| duration_ms | | source_checksum |
| error_message | | output_path |
| quarantined | | output_size |
| quarantine_path | | output_checksum |
| created_at | | conversion_options|
+----------------------+ | started_at |
| completed_at |
| duration_ms |
| error_message |
| created_by |
| created_at |
+----------------------+
+----------------------+ +----------------------+
|preservation_backup_ | |preservation_replica- |
| verification | | tion_target |
+----------------------+ +----------------------+
| PK id | | PK id |
| backup_type | | name |
| backup_path | | target_type |
| backup_size | | connection_config |
| original_checksum | | description |
| verified_checksum | | is_active |
| status | | last_sync_at |
| verification_ | | created_at |
| method | | updated_at |
| files_checked | +----------------------+
| files_valid | |
| files_invalid | | 1:N
| files_missing | v
| verified_at | +----------------------+
| verified_by | |preservation_replica- |
| duration_ms | | tion_log |
| error_message | +----------------------+
| details (JSON) | | PK id |
| created_at | | FK target_id |
+----------------------+ | FK digital_object_id |
| operation |
+----------------------+ | file_path |
| preservation_event | | file_size |
+----------------------+ | checksum |
| PK id | | status |
| FK digital_object_id | | error_message |
| FK information_ | | started_at |
| object_id | | completed_at |
| event_type | | duration_ms |
| event_datetime | | created_at |
| event_detail | +----------------------+
| event_outcome |
| event_outcome_ |
| detail |
| linking_agent_ |
| type |
| linking_agent_ |
| value |
| created_at |
+----------------------+
+----------------------+ +----------------------+
| preservation_format | |preservation_object_ |
+----------------------+ | format |
| PK id | +----------------------+
| puid |<----| PK id |
| mime_type | | FK digital_object_id |
| format_name | | FK format_id |
| format_version | | puid |
| extension | | mime_type |
| risk_level | | format_name |
| risk_notes | | format_version |
| preservation_ | | identification_ |
| action | | tool |
| FK migration_ | | identification_ |
| target_id | | date |
| is_preservation_ | | confidence |
| format | | basis |
| created_at | | warning |
| updated_at | | created_at |
+----------------------+ +----------------------+
Database Tables¶
preservation_virus_scan¶
Records virus scan results for digital objects.
| Column | Type | Description |
|---|---|---|
id |
BIGINT PK | Auto-increment primary key |
digital_object_id |
INT FK | Reference to digital_object |
scan_engine |
VARCHAR(50) | Scanner used (e.g., 'clamav') |
engine_version |
VARCHAR(50) | Scanner version |
signature_version |
VARCHAR(50) | Virus signature version |
status |
ENUM | clean, infected, error, skipped |
threat_name |
VARCHAR(255) | Name of detected threat |
file_path |
VARCHAR(1024) | Path to scanned file |
file_size |
BIGINT UNSIGNED | File size in bytes |
scanned_at |
DATETIME | When scan was performed |
scanned_by |
VARCHAR(100) | User or 'system'/'cron' |
duration_ms |
INT UNSIGNED | Scan duration in milliseconds |
error_message |
TEXT | Error details if failed |
quarantined |
TINYINT(1) | Whether file was quarantined |
quarantine_path |
VARCHAR(1024) | Path in quarantine |
preservation_format_conversion¶
Records format conversion operations.
| Column | Type | Description |
|---|---|---|
id |
BIGINT PK | Auto-increment primary key |
digital_object_id |
INT FK | Reference to digital_object |
source_format |
VARCHAR(50) | Original file extension |
source_mime_type |
VARCHAR(100) | Original MIME type |
target_format |
VARCHAR(50) | Target file extension |
target_mime_type |
VARCHAR(100) | Target MIME type |
conversion_tool |
VARCHAR(100) | Tool used (imagemagick, ffmpeg, etc.) |
tool_version |
VARCHAR(50) | Tool version |
status |
ENUM | pending, processing, completed, failed |
source_path |
VARCHAR(1024) | Path to source file |
source_size |
BIGINT UNSIGNED | Source file size |
source_checksum |
VARCHAR(128) | Source file SHA-256 |
output_path |
VARCHAR(1024) | Path to converted file |
output_size |
BIGINT UNSIGNED | Converted file size |
output_checksum |
VARCHAR(128) | Converted file SHA-256 |
conversion_options |
JSON | Options used for conversion |
started_at |
DATETIME | Conversion start time |
completed_at |
DATETIME | Conversion completion time |
duration_ms |
INT UNSIGNED | Conversion duration |
error_message |
TEXT | Error details if failed |
created_by |
VARCHAR(100) | User who initiated |
preservation_backup_verification¶
Records backup verification results.
| Column | Type | Description |
|---|---|---|
id |
BIGINT PK | Auto-increment primary key |
backup_type |
ENUM | database, files, full |
backup_path |
VARCHAR(1024) | Path to backup |
backup_size |
BIGINT UNSIGNED | Backup size in bytes |
original_checksum |
VARCHAR(128) | Expected checksum |
verified_checksum |
VARCHAR(128) | Calculated checksum |
status |
ENUM | valid, invalid, corrupted, missing, warning, error |
verification_method |
VARCHAR(50) | Method used (sha256, etc.) |
files_checked |
INT UNSIGNED | Number of files verified |
files_valid |
INT UNSIGNED | Files that passed |
files_invalid |
INT UNSIGNED | Files that failed |
files_missing |
INT UNSIGNED | Missing files |
verified_at |
DATETIME | When verification ran |
verified_by |
VARCHAR(100) | User or 'system' |
duration_ms |
INT UNSIGNED | Verification duration |
error_message |
TEXT | Error details |
details |
JSON | Additional verification details |
preservation_replication_target¶
Defines backup replication destinations.
| Column | Type | Description |
|---|---|---|
id |
BIGINT PK | Auto-increment primary key |
name |
VARCHAR(255) | Target name |
target_type |
ENUM | local, sftp, rsync, s3 |
connection_config |
JSON | Connection parameters |
description |
TEXT | Target description |
is_active |
TINYINT(1) | Whether target is enabled |
last_sync_at |
DATETIME | Last successful sync |
created_at |
DATETIME | Record creation time |
updated_at |
DATETIME | Last update time |
Connection Config JSON Structure:
Local:
SFTP/Rsync:
S3:
preservation_replication_log¶
Records replication operations.
| Column | Type | Description |
|---|---|---|
id |
BIGINT PK | Auto-increment primary key |
target_id |
BIGINT FK | Reference to replication_target |
digital_object_id |
INT FK | Reference to digital_object (nullable) |
operation |
ENUM | upload, verify, delete |
file_path |
VARCHAR(1024) | Source file path |
file_size |
BIGINT UNSIGNED | File size |
checksum |
VARCHAR(128) | File checksum |
status |
ENUM | started, completed, failed |
error_message |
TEXT | Error details |
started_at |
DATETIME | Operation start |
completed_at |
DATETIME | Operation end |
duration_ms |
INT UNSIGNED | Operation duration |
preservation_package¶
Main table for OAIS packages (SIP/AIP/DIP).
| Column | Type | Description |
|---|---|---|
id |
BIGINT PK | Auto-increment primary key |
uuid |
CHAR(36) UK | Unique package identifier |
name |
VARCHAR(255) | Package name |
description |
TEXT | Package description |
package_type |
ENUM | sip, aip, dip |
status |
ENUM | draft, building, complete, validated, exported, error |
package_format |
ENUM | bagit, zip, tar, directory |
bagit_version |
VARCHAR(10) | BagIt version (default 1.0) |
object_count |
INT UNSIGNED | Number of objects in package |
total_size |
BIGINT UNSIGNED | Total package size in bytes |
manifest_algorithm |
VARCHAR(20) | Checksum algorithm (sha256) |
package_checksum |
VARCHAR(128) | Overall package checksum |
source_path |
VARCHAR(1024) | Path to built package directory |
export_path |
VARCHAR(1024) | Path to exported archive file |
originator |
VARCHAR(255) | Creating organization |
submission_agreement |
VARCHAR(255) | Reference to agreement |
retention_period |
VARCHAR(100) | Retention policy |
parent_package_id |
BIGINT FK | Parent package (for conversions) |
information_object_id |
INT FK | Linked archival description |
created_by |
VARCHAR(100) | Creator user |
created_at |
DATETIME | Creation timestamp |
built_at |
DATETIME | When package was built |
validated_at |
DATETIME | When package was validated |
exported_at |
DATETIME | When package was exported |
metadata |
JSON | Additional metadata |
preservation_package_object¶
Links digital objects to packages.
| Column | Type | Description |
|---|---|---|
id |
BIGINT PK | Auto-increment primary key |
package_id |
BIGINT FK | Reference to preservation_package |
digital_object_id |
INT FK | Reference to digital_object |
relative_path |
VARCHAR(1024) | Path within package (e.g., data/file.pdf) |
file_name |
VARCHAR(255) | File name |
file_size |
BIGINT UNSIGNED | File size in bytes |
checksum_algorithm |
VARCHAR(20) | Checksum algorithm used |
checksum_value |
VARCHAR(128) | File checksum |
mime_type |
VARCHAR(100) | MIME type |
puid |
VARCHAR(50) | PRONOM identifier |
object_role |
ENUM | payload, metadata, manifest, tagfile |
sequence |
INT UNSIGNED | Order in package |
added_at |
DATETIME | When object was added |
preservation_package_event¶
Records package lifecycle events.
| Column | Type | Description |
|---|---|---|
id |
BIGINT PK | Auto-increment primary key |
package_id |
BIGINT FK | Reference to preservation_package |
event_type |
ENUM | creation, modification, building, validation, export, import, transfer, deletion, error |
event_datetime |
DATETIME | When event occurred |
event_detail |
TEXT | Event description |
event_outcome |
ENUM | success, failure, warning |
event_outcome_detail |
TEXT | Additional outcome details |
agent_type |
VARCHAR(50) | Agent type (software, human) |
agent_value |
VARCHAR(255) | Agent identifier |
created_by |
VARCHAR(100) | User who triggered event |
Service Layer API¶
PreservationService¶
Main service class providing all preservation operations.
Format Identification (Siegfried/PRONOM)¶
/**
* Check if Siegfried is available
*
* @return bool
*/
public function isSiegfriedAvailable(): bool
/**
* Get Siegfried version information
*
* @return array|null Version info including version and signature_date
*/
public function getSiegfriedVersion(): ?array
/**
* Identify format using Siegfried (PRONOM)
*
* @param int $digitalObjectId
* @param bool $updateRegistry Add format to registry if new
* @return array Results with puid, format_name, mime_type, confidence, basis, warning
*/
public function identifyFormat(int $digitalObjectId, bool $updateRegistry = true): array
/**
* Re-identify an already identified object
*
* @param int $digitalObjectId
* @return array Results with updated identification
*/
public function reidentifyFormat(int $digitalObjectId): array
/**
* Batch identification for multiple objects
*
* @param int $limit Maximum objects to identify
* @param bool $unidentifiedOnly Only identify objects without existing identification
* @param bool $updateRegistry Add new formats to registry
* @return array Summary with identified, failed, skipped counts
*/
public function runBatchIdentification(int $limit = 100, bool $unidentifiedOnly = true, bool $updateRegistry = true): array
/**
* Get identification statistics
*
* @return array Stats including total_objects, identified, unidentified, by_confidence, top_formats
*/
public function getIdentificationStatistics(): array
/**
* Get identification history
*
* @param int $limit Maximum records
* @param string|null $confidence Filter by confidence level
* @return array Identification records
*/
public function getIdentificationLog(int $limit = 100, ?string $confidence = null): array
Virus Scanning¶
/**
* Check if ClamAV is available
*
* @return bool
*/
public function isClamAvAvailable(): bool
/**
* Get ClamAV version information
*
* @return array|null Version info or null if not installed
*/
public function getClamAvVersion(): ?array
/**
* Scan a digital object for viruses
*
* @param int $digitalObjectId
* @param bool $quarantine Move infected files to quarantine
* @param string $scannedBy User or system identifier
* @return array Scan results with status, threat_name, quarantined, etc.
*/
public function scanForVirus(int $digitalObjectId, bool $quarantine = true, string $scannedBy = 'system'): array
/**
* Batch virus scan for multiple objects
*
* @param int $limit Maximum objects to scan
* @param bool $newOnly Only scan unscanned objects
* @param string $scannedBy Identifier
* @return array Summary with clean, infected, errors counts
*/
public function runBatchVirusScan(int $limit = 100, bool $newOnly = true, string $scannedBy = 'cron'): array
/**
* Get virus scan history
*
* @param int $limit Maximum records
* @param string|null $status Filter by status
* @return array Scan records
*/
public function getVirusScanLog(int $limit = 100, ?string $status = null): array
Format Conversion¶
/**
* Get available conversion tools
*
* @return array Tool information (imagemagick, ffmpeg, ghostscript, libreoffice)
*/
public function getConversionTools(): array
/**
* Convert a digital object to a different format
*
* @param int $digitalObjectId
* @param string $targetFormat Target format (tiff, pdf, mp4, etc.)
* @param array $options Conversion options (quality, compress, etc.)
* @param string $createdBy User identifier
* @return array Result with success, output_path, output_size, etc.
*/
public function convertFormat(int $digitalObjectId, string $targetFormat, array $options = [], string $createdBy = 'system'): array
/**
* Get conversion history
*
* @param int $limit Maximum records
* @param string|null $status Filter by status
* @return array Conversion records
*/
public function getConversionLog(int $limit = 100, ?string $status = null): array
Backup Verification¶
/**
* Verify backup integrity
*
* @param string $backupPath Path to backup file or directory
* @param string $backupType Type: database, files, full
* @param string|null $expectedChecksum Expected checksum if known
* @param string $verifiedBy User identifier
* @return array Results with status, files_checked, files_valid, etc.
*/
public function verifyBackup(string $backupPath, string $backupType = 'full', ?string $expectedChecksum = null, string $verifiedBy = 'system'): array
/**
* Verify all backups in a directory
*
* @param string|null $backupDir Backup directory (default: uploads/backups)
* @param string $verifiedBy Identifier
* @return array Summary with total, valid, invalid counts
*/
public function verifyAllBackups(string $backupDir = null, string $verifiedBy = 'cron'): array
/**
* Get backup verification history
*
* @param int $limit Maximum records
* @param string|null $status Filter by status
* @return array Verification records
*/
public function getBackupVerificationLog(int $limit = 100, ?string $status = null): array
Extended Statistics¶
/**
* Get extended statistics including new features
*
* @return array Statistics including virus_scans_30d, conversions_30d, etc.
*/
public function getExtendedStatistics(): array
CLI Tasks¶
preservationSchedulerTask¶
Runs scheduled preservation workflows based on their configured schedules.
Location: lib/task/preservationSchedulerTask.class.php
Namespace: preservation:scheduler
Options:
| Option | Default | Description |
|---|---|---|
--status |
- | Show scheduler status and statistics |
--list |
- | List all configured schedules |
--run-id=N |
- | Run a specific schedule by ID |
--dry-run |
- | Show what would be run without executing |
Usage:
# Run all due workflows (intended for cron)
php symfony preservation:scheduler
# Show scheduler status
php symfony preservation:scheduler --status
# List all schedules
php symfony preservation:scheduler --list
# Run specific schedule
php symfony preservation:scheduler --run-id=1
# Preview without running
php symfony preservation:scheduler --dry-run
Cron Setup:
# Run scheduler every minute
* * * * * cd /usr/share/nginx/archive && php symfony preservation:scheduler >> /var/log/atom/scheduler.log 2>&1
preservationIdentifyTask¶
Identifies file formats using Siegfried (PRONOM-based identification).
Location: lib/task/preservationIdentifyTask.class.php
Namespace: preservation:identify
Options:
| Option | Default | Description |
|---|---|---|
--status |
- | Show Siegfried status and statistics |
--dry-run |
- | Preview without identifying |
--limit=N |
100 | Maximum objects to identify |
--object-id=N |
- | Identify specific object |
--all |
false | Identify all objects (including already identified) |
--reidentify |
false | Force re-identification of already identified objects |
Usage:
php symfony preservation:identify --status
php symfony preservation:identify --dry-run
php symfony preservation:identify --limit=500
php symfony preservation:identify --object-id=123
php symfony preservation:identify --all --limit=1000
php symfony preservation:identify --object-id=123 --reidentify
Output includes: - PRONOM Unique Identifier (PUID) - Format name and version - MIME type - Confidence level (certain, high, medium, low) - Identification basis (signature, extension, container, byte match) - Warnings (if any)
preservationVirusScanTask¶
Scans digital objects for viruses using ClamAV.
Location: lib/task/preservationVirusScanTask.class.php
Namespace: preservation:virus-scan
Options:
| Option | Default | Description |
|---|---|---|
--status |
- | Show ClamAV status and statistics |
--dry-run |
- | Preview without scanning |
--limit=N |
100 | Maximum objects to scan |
--object-id=N |
- | Scan specific object |
--all |
false | Scan all objects (including previously scanned) |
--no-quarantine |
false | Don't quarantine infected files |
Usage:
php symfony preservation:virus-scan --status
php symfony preservation:virus-scan --dry-run
php symfony preservation:virus-scan --limit=200
php symfony preservation:virus-scan --object-id=123
php symfony preservation:virus-scan --all --limit=500
preservationConvertTask¶
Converts digital objects to preservation-safe formats.
Location: lib/task/preservationConvertTask.class.php
Namespace: preservation:convert
Options:
| Option | Default | Description |
|---|---|---|
--status |
- | Show conversion tools and statistics |
--dry-run |
- | Preview without converting |
--limit=N |
10 | Maximum objects to convert |
--object-id=N |
- | Convert specific object |
--format=X |
- | Target format (tiff, pdf, wav, etc.) |
--mime-type=X |
- | Filter by source MIME type |
--quality=N |
95 | Conversion quality (1-100) |
Usage:
php symfony preservation:convert --status
php symfony preservation:convert --dry-run
php symfony preservation:convert --limit=50
php symfony preservation:convert --object-id=123 --format=tiff
php symfony preservation:convert --mime-type=image/jpeg --format=tiff --limit=100
Supported Conversions:
| Source | Target | Tool |
|---|---|---|
| image/jpeg, image/png, image/bmp, image/gif | TIFF | ImageMagick |
| audio/mpeg, audio/ogg | WAV | FFmpeg |
| video/* | MKV | FFmpeg |
| application/msword, application/vnd.ms-excel, application/vnd.ms-powerpoint | LibreOffice | |
| application/vnd.openxmlformats-* | LibreOffice | |
| application/pdf | PDF/A | Ghostscript |
preservationVerifyBackupTask¶
Verifies backup file integrity.
Location: lib/task/preservationVerifyBackupTask.class.php
Namespace: preservation:verify-backup
Options:
| Option | Default | Description |
|---|---|---|
--path=X |
- | Path to specific backup file |
--backup-dir=X |
uploads/backups | Backup directory |
--checksum=X |
- | Expected checksum |
--all |
false | Verify all backups in directory |
Usage:
php symfony preservation:verify-backup --path=/backups/backup.tar.gz
php symfony preservation:verify-backup --all --backup-dir=/backups
php symfony preservation:verify-backup --path=/backups/db.sql.gz --checksum=abc123...
preservationReplicateTask¶
Replicates files to backup targets.
Location: lib/task/preservationReplicateTask.class.php
Namespace: preservation:replicate
Options:
| Option | Default | Description |
|---|---|---|
--status |
- | Show targets and statistics |
--dry-run |
- | Preview without replicating |
--limit=N |
100 | Maximum objects to replicate |
--target-id=N |
- | Replicate to specific target |
--force |
false | Re-sync already synced files |
Usage:
php symfony preservation:replicate --status
php symfony preservation:replicate --dry-run
php symfony preservation:replicate --limit=500
php symfony preservation:replicate --target-id=1 --limit=100
php symfony preservation:replicate --force --limit=50
preservationFixityTask¶
Runs fixity verification checks with optional self-healing auto-repair.
Location: lib/task/preservationFixityTask.class.php
Namespace: preservation:fixity
Options:
| Option | Default | Description |
|---|---|---|
--status |
- | Show statistics |
--repair-stats |
- | Show self-healing repair statistics |
--dry-run |
- | Preview without verifying |
--limit=N |
100 | Maximum objects to check |
--stale-days=N |
30 | Check objects not verified in N days |
--all |
false | Check all regardless of age |
--object-id=N |
- | Check specific object |
--auto-repair |
false | Enable self-healing auto-repair from backups |
--failed-only |
false | Only show failed checks in status |
Self-Healing Auto-Repair:
When --auto-repair is enabled, the system will automatically:
1. Detect failed fixity checks (checksum mismatch or missing files)
2. Search configured replication targets for valid backup copies
3. Validate backup integrity against stored checksums
4. Restore corrupted files from verified backups
5. Log all repair events as PREMIS preservation events
Supported Backup Targets: - Local file system paths - Rsync targets - SFTP servers - Amazon S3 buckets (requires AWS SDK) - Azure Blob Storage (requires SDK) - Google Cloud Storage (requires SDK)
Usage:
php symfony preservation:fixity --status
php symfony preservation:fixity --repair-stats
php symfony preservation:fixity --limit=500
php symfony preservation:fixity --object-id=123
php symfony preservation:fixity --auto-repair
php symfony preservation:fixity --stale-days=7 --auto-repair
php symfony preservation:fixity --object-id=123 --auto-repair
preservationPronomSyncTask¶
Synchronizes format registry from UK National Archives PRONOM database.
Location: lib/task/preservationPronomSyncTask.class.php
Namespace: preservation:pronom-sync
Options:
| Option | Default | Description |
|---|---|---|
--status |
- | Show PRONOM sync status |
--puid=X |
- | Sync specific PUID (e.g., fmt/18) |
--lookup=X |
- | Look up PUID without syncing |
--unregistered |
- | Sync only unregistered PUIDs |
--common |
- | Sync common archival format PUIDs |
--all |
- | Sync all known PUIDs |
PRONOM Data Retrieved: - Official format names and versions - MIME types and file extensions - Binary signature availability - Format risk information - Preservation recommendations
Usage:
php symfony preservation:pronom-sync --status
php symfony preservation:pronom-sync --puid=fmt/18
php symfony preservation:pronom-sync --lookup=fmt/43
php symfony preservation:pronom-sync --unregistered
php symfony preservation:pronom-sync --common
php symfony preservation:pronom-sync --all
preservationPackageTask¶
Manages OAIS packages (SIP/AIP/DIP) using BagIt format.
Location: lib/task/preservationPackageTask.class.php
Namespace: preservation:package
Actions:
| Action | Description |
|---|---|
list |
List all packages |
create |
Create a new package |
show |
Show package details |
add-objects |
Add digital objects to package |
build |
Build the BagIt package |
validate |
Validate package checksums |
export |
Export to ZIP/TAR format |
convert |
Convert SIP to AIP or AIP to DIP |
Options:
| Option | Default | Description |
|---|---|---|
--id=N |
- | Package ID |
--uuid=X |
- | Package UUID |
--type=X |
- | Package type (sip, aip, dip) |
--status=X |
- | Filter by status |
--name=X |
- | Package name |
--description=X |
- | Package description |
--format=X |
zip | Export format (zip, tar, tar.gz) |
--output=X |
- | Output path |
--objects=X |
- | Comma-separated object IDs |
--query=X |
- | Object query (e.g., "mime_type:application/pdf") |
--originator=X |
- | Creating organization |
--limit=N |
20 | Limit for list operations |
Usage:
# List packages
php symfony preservation:package list
php symfony preservation:package list --type=sip --status=draft
# Create package
php symfony preservation:package create --type=sip --name="My Collection SIP"
# Show package details
php symfony preservation:package show --id=1
# Add objects
php symfony preservation:package add-objects --id=1 --objects=100,101,102
php symfony preservation:package add-objects --id=1 --query="mime_type:application/pdf"
# Build package
php symfony preservation:package build --id=1
# Validate
php symfony preservation:package validate --id=1
# Export
php symfony preservation:package export --id=1 --format=zip
# Convert SIP to AIP
php symfony preservation:package convert --id=1 --type=aip
Process Flows¶
Virus Scan Flow¶
+-------------------------------------------------------------------------+
| VIRUS SCAN SEQUENCE |
+-------------------------------------------------------------------------+
+----------+ +----------------+ +--------------+
| Client | |PreservationSvc | | ClamAV |
+----+-----+ +-------+--------+ +------+-------+
| | |
| scanForVirus | |
| (objectId) | |
|------------------>| |
| | |
| | isClamAvAvailable() |
| |-------------------->|
| |<--------------------|
| | |
| | Get file path |
| | Verify file exists |
| | |
| | clamscan/clamdscan |
| |-------------------->|
| |<--------------------|
| | (return code) |
| | |
| | Parse results |
| | 0=clean, 1=infected |
| | |
| | If infected: |
| | Quarantine file |
| | |
| | INSERT virus_scan |
| | INSERT event |
| | |
| return results | |
|<------------------| |
Format Conversion Flow¶
+-------------------------------------------------------------------------+
| FORMAT CONVERSION SEQUENCE |
+-------------------------------------------------------------------------+
+----------+ +----------------+ +--------------+
| Client | |PreservationSvc | | Tool (IM/FF) |
+----+-----+ +-------+--------+ +------+-------+
| | |
| convertFormat | |
| (objectId, fmt) | |
|------------------>| |
| | |
| | Get digital object |
| | Get file path |
| | |
| | selectConversionTool()
| | (based on MIME type)|
| | |
| | INSERT conversion |
| | status=processing |
| | |
| | executeConversion() |
| |-------------------->|
| |<--------------------|
| | (output file) |
| | |
| | Verify output exists|
| | Generate checksum |
| | |
| | UPDATE conversion |
| | status=completed |
| | |
| | INSERT event |
| | |
| return results | |
|<------------------| |
Replication Flow¶
+-------------------------------------------------------------------------+
| REPLICATION SEQUENCE |
+-------------------------------------------------------------------------+
+----------+ +----------------+ +--------------+
| CLI | |ReplicateTask | | Target |
+----+-----+ +-------+--------+ +------+-------+
| | |
| Execute task | |
|------------------>| |
| | |
| | Get active targets |
| | Get objects to sync |
| | |
| | For each object: |
| | Get file path |
| | Generate checksum |
| | |
| | INSERT log |
| | status=started |
| | |
| | Transfer file |
| | (based on type) |
| |-------------------->|
| |<--------------------|
| | |
| | UPDATE log |
| | status=completed |
| | |
| | UPDATE target |
| | last_sync_at |
| | |
| print summary | |
|<------------------| |
Routes¶
Defined in modules/preservation/config/module.yml:
| Route | URL | Action |
|---|---|---|
preservation_index |
/preservation |
Dashboard |
preservation_identification |
/preservation/identification |
Format identification UI |
preservation_scheduler |
/preservation/scheduler |
Workflow scheduler UI |
preservation_schedule_edit |
/preservation/scheduleEdit |
Create/edit schedule |
preservation_virus_scan |
/preservation/virus-scan |
Virus scan UI |
preservation_conversion |
/preservation/conversion |
Conversion UI |
preservation_backup |
/preservation/backup |
Backup verification UI |
preservation_extended |
/preservation/extended |
Extended stats |
Settings routes in ahgThemeB5Plugin:
| Route | URL | Action |
|---|---|---|
ahgSettings_preservation |
/ahgSettings/preservation |
Replication target management |
Configuration¶
External Tool Requirements¶
Siegfried (PRONOM Format Identification):
# Install
curl -sL "https://github.com/richardlehane/siegfried/releases/download/v1.11.1/siegfried_1.11.1-1_amd64.deb" -o /tmp/sf.deb
sudo dpkg -i /tmp/sf.deb
# Verify installation
sf -version
# Update signatures (optional)
sf -update
ClamAV:
# Install
sudo apt install clamav clamav-daemon
# Update signatures
sudo freshclam
# Start daemon (optional, for faster scans)
sudo systemctl enable clamav-daemon
sudo systemctl start clamav-daemon
ImageMagick:
FFmpeg:
LibreOffice:
Ghostscript:
Quarantine Directory¶
Infected files are moved to: {sf_upload_dir}/quarantine/
Ensure this directory exists and has appropriate permissions:
mkdir -p /usr/share/nginx/archive/uploads/quarantine
chmod 750 /usr/share/nginx/archive/uploads/quarantine
chown www-data:www-data /usr/share/nginx/archive/uploads/quarantine
Conversion Output Directory¶
Converted files are stored in: {sf_upload_dir}/conversions/
mkdir -p /usr/share/nginx/archive/uploads/conversions
chmod 755 /usr/share/nginx/archive/uploads/conversions
chown www-data:www-data /usr/share/nginx/archive/uploads/conversions
Security Considerations¶
- Access Control: All preservation actions require administrator role
- File Access: Service only reads files within configured upload directory
- Audit Trail: All operations logged as PREMIS events
- Quarantine: Infected files isolated with restricted permissions
- Input Sanitization: All file paths escaped for shell commands
- Error Messages: Binary data sanitized before database storage
Performance Considerations¶
- Batch Processing: Use batch operations for large collections
- Scheduling: Run intensive operations during off-peak hours
- ClamAV Daemon: Use clamdscan (daemon) instead of clamscan for faster scans
- Conversion Queue: Process conversions in batches with limits
- Indexing: All foreign keys and frequent query columns indexed
Cron Configuration¶
Recommended cron schedule:
# Daily format identification at 1am
0 1 * * * cd /usr/share/nginx/archive && php symfony preservation:identify --limit=500 >> /var/log/atom/identify.log 2>&1
# Daily fixity check at 2am
0 2 * * * cd /usr/share/nginx/archive && php symfony preservation:fixity --limit=500 >> /var/log/atom/fixity.log 2>&1
# Daily virus scan at 3am (new files only)
0 3 * * * cd /usr/share/nginx/archive && php symfony preservation:virus-scan --limit=200 >> /var/log/atom/virus-scan.log 2>&1
# Weekly format conversion on Sunday at 4am
0 4 * * 0 cd /usr/share/nginx/archive && php symfony preservation:convert --limit=100 >> /var/log/atom/conversion.log 2>&1
# Daily replication at 5am
0 5 * * * cd /usr/share/nginx/archive && php symfony preservation:replicate --limit=500 >> /var/log/atom/replication.log 2>&1
# Weekly backup verification on Saturday at 6am
0 6 * * 6 cd /usr/share/nginx/archive && php symfony preservation:verify-backup --all >> /var/log/atom/backup-verify.log 2>&1
Error Handling¶
| Error | Cause | Resolution |
|---|---|---|
Siegfried not installed |
sf command not in PATH | Install Siegfried (see Configuration) |
ClamAV not installed |
clamscan not in PATH | Install: apt install clamav |
No conversion tool available |
Missing tool for format | Install required tool |
File not found |
Missing physical file | Check storage, restore backup |
Format identification failed |
Siegfried error | Check Siegfried installation, update signatures |
PUID showing as UNKNOWN |
Format not in PRONOM | File may have non-standard format |
Conversion failed |
Tool error | Check tool logs, verify format |
Replication failed |
Network/permission error | Check target connectivity |
Incorrect string value |
Binary in error message | Handled by sanitization |
Monitoring¶
Key Metrics¶
- Virus scans per day (clean vs infected)
- Conversions per day (completed vs failed)
- Replication success rate
- Backup verification status
- Fixity check pass rate
Alerting Thresholds¶
| Metric | Warning | Critical |
|---|---|---|
| Infected Files (30d) | > 0 | > 5 |
| Conversion Failures (30d) | > 10 | > 50 |
| Replication Failures (30d) | > 5 | > 20 |
| Backup Verification Failures | > 0 | > 0 |
| Fixity Failures (30d) | > 0 | > 10 |
Format Migration¶
Overview¶
The format migration subsystem enables proactive digital preservation through: - Migration Pathways: Define recommended conversion routes between formats - Migration Plans: Batch migration planning with tracking - Obsolescence Analysis: Identify at-risk formats in the repository - Automated Recommendations: Suggest target formats based on risk levels
Migration Pathway Tables¶
+-------------------------------------------------------------------------+
| FORMAT MIGRATION DATABASE SCHEMA |
+-------------------------------------------------------------------------+
+---------------------------+ +---------------------------+
| preservation_format | | preservation_migration_ |
| (existing table) | | pathway |
+---------------------------+ +---------------------------+
| PK id |<----------| PK id |
| puid | FK | FK source_format_id |
| format_name | | FK target_format_id |--------+
| risk_level | +------| FK preferred_tool_id | |
| ... | | | pathway_type | |
+---------------------------+ | | confidence_level | |
| | complexity | |
+---------------------------+ | | data_loss_risk | |
| preservation_conversion_ | | | quality_impact | |
| tool |<---+ | is_recommended | |
+---------------------------+ | is_automated | |
| PK id | | notes | |
| tool_name | | created_at | |
| tool_version | | updated_at | |
| ... | +---------------------------+ |
+---------------------------+ | |
| 1:N |
v |
+---------------------------+ |
| preservation_migration_ | |
| pathway_tool | |
+---------------------------+ |
| PK id | |
| FK pathway_id | |
| FK tool_id | |
| priority | |
| conversion_options | |
+---------------------------+ |
|
+---------------------------+ +---------------------------+ |
| preservation_migration_ | | preservation_migration_ | |
| plan | | plan_item | |
+---------------------------+ +---------------------------+ |
| PK id |<----------| PK id | |
| name | FK | FK plan_id | |
| description | | FK digital_object_id | |
| status | | FK pathway_id |--------+
| created_by | | source_format |
| approved_by | | target_format |
| approved_at | | priority |
| started_at | | status |
| completed_at | | started_at |
| total_items | | completed_at |
| items_completed | | conversion_log |
| items_failed | | error_message |
| notes | | created_at |
| created_at | +---------------------------+
| updated_at |
+---------------------------+
Migration Database Tables¶
preservation_migration_pathway¶
Defines conversion routes between format types.
| Column | Type | Description |
|---|---|---|
id |
BIGINT PK | Auto-increment primary key |
source_format_id |
BIGINT FK | Reference to preservation_format |
target_format_id |
BIGINT FK | Target format reference |
preferred_tool_id |
BIGINT FK | Preferred conversion tool |
pathway_type |
ENUM | normalization, migration, emulation |
confidence_level |
ENUM | high, medium, low |
complexity |
ENUM | simple, moderate, complex |
data_loss_risk |
ENUM | none, minimal, moderate, significant |
quality_impact |
ENUM | lossless, minimal_loss, noticeable_loss |
is_recommended |
TINYINT(1) | Official recommendation |
is_automated |
TINYINT(1) | Can be automated |
notes |
TEXT | Additional guidance |
preservation_migration_plan¶
Batch migration execution plans.
| Column | Type | Description |
|---|---|---|
id |
BIGINT PK | Auto-increment primary key |
name |
VARCHAR(255) | Plan name |
description |
TEXT | Plan description |
status |
ENUM | draft, approved, in_progress, completed, cancelled |
created_by |
VARCHAR(100) | Creator |
approved_by |
VARCHAR(100) | Approver |
approved_at |
DATETIME | Approval timestamp |
started_at |
DATETIME | Execution start |
completed_at |
DATETIME | Execution completion |
total_items |
INT UNSIGNED | Total items in plan |
items_completed |
INT UNSIGNED | Successfully migrated |
items_failed |
INT UNSIGNED | Failed migrations |
preservation_migration_plan_item¶
Individual items within a migration plan.
| Column | Type | Description |
|---|---|---|
id |
BIGINT PK | Auto-increment primary key |
plan_id |
BIGINT FK | Reference to migration_plan |
digital_object_id |
INT FK | Reference to digital_object |
pathway_id |
BIGINT FK | Reference to migration_pathway |
source_format |
VARCHAR(100) | Source PUID or MIME type |
target_format |
VARCHAR(100) | Target format |
priority |
INT | Processing priority |
status |
ENUM | pending, processing, completed, failed, skipped |
started_at |
DATETIME | Item processing start |
completed_at |
DATETIME | Item processing end |
conversion_log |
TEXT | Conversion output log |
error_message |
TEXT | Error details if failed |
Migration Service API¶
MigrationPathwayService¶
/**
* Get available migration pathways for a source format.
*
* @param string $sourceFormat Source PUID or MIME type
* @return array Available pathways with recommendations
*/
public function getPathwaysForFormat(string $sourceFormat): array
/**
* Get recommended target format for a source.
*
* @param string $sourceFormat Source PUID or MIME type
* @return array|null Best pathway with target format
*/
public function getRecommendedTarget(string $sourceFormat): ?array
/**
* Get obsolescence report for repository.
*
* @param string $riskLevel Filter by risk (critical, high, medium, low)
* @return array Formats at risk with object counts
*/
public function getObsolescenceReport(?string $riskLevel = null): array
/**
* Create a new migration pathway.
*
* @param array $data Pathway configuration
* @return int New pathway ID
*/
public function createPathway(array $data): int
MigrationPlanService¶
/**
* Create a migration plan.
*
* @param string $name Plan name
* @param string $description Plan description
* @param string $createdBy Creator user
* @return int New plan ID
*/
public function createPlan(string $name, string $description, string $createdBy): int
/**
* Add items to a migration plan.
*
* @param int $planId Plan ID
* @param array $items Array of [digital_object_id, pathway_id, priority]
* @return int Number of items added
*/
public function addPlanItems(int $planId, array $items): int
/**
* Add items by format criteria.
*
* @param int $planId Plan ID
* @param string $sourceFormat Source format to match
* @param int $pathwayId Pathway to use
* @param int $limit Maximum items to add
* @return int Number of items added
*/
public function addItemsByFormat(int $planId, string $sourceFormat, int $pathwayId, int $limit = 1000): int
/**
* Execute a migration plan.
*
* @param int $planId Plan ID
* @param int $limit Items per batch
* @return array Execution results
*/
public function executePlan(int $planId, int $limit = 100): array
/**
* Get plan execution status.
*
* @param int $planId Plan ID
* @return array Status with progress metrics
*/
public function getPlanStatus(int $planId): array
preservationMigrationTask¶
CLI task for format migration operations.
Location: lib/task/preservationMigrationTask.class.php
Namespace: preservation:migration
Actions:
| Action | Description |
|---|---|
pathways |
List available migration pathways |
obsolescence |
Generate obsolescence report |
recommend |
Get recommendations for a format |
plan-list |
List migration plans |
plan-create |
Create a new plan |
plan-add |
Add items to a plan |
plan-execute |
Execute a plan |
plan-status |
Show plan status |
Options:
| Option | Default | Description |
|---|---|---|
--format=X |
- | Source format (PUID or MIME) |
--risk=X |
- | Risk level filter |
--plan-id=N |
- | Plan ID |
--name=X |
- | Plan name |
--pathway-id=N |
- | Pathway ID |
--limit=N |
100 | Items limit |
Usage:
# List pathways for a format
php symfony preservation:migration pathways --format=fmt/18
# Generate obsolescence report
php symfony preservation:migration obsolescence
php symfony preservation:migration obsolescence --risk=critical
# Get recommendations
php symfony preservation:migration recommend --format=fmt/18
# Create migration plan
php symfony preservation:migration plan-create --name="TIFF Migration 2026"
# Add items to plan
php symfony preservation:migration plan-add --plan-id=1 --format=fmt/353 --pathway-id=5
# Execute plan
php symfony preservation:migration plan-execute --plan-id=1 --limit=50
# Check status
php symfony preservation:migration plan-status --plan-id=1
Migration Process Flow¶
+-------------------------------------------------------------------------+
| FORMAT MIGRATION SEQUENCE |
+-------------------------------------------------------------------------+
+----------+ +----------------+ +----------------+
| Admin | |MigrationPlanSvc| |PathwaySvc |
+----+-----+ +-------+--------+ +-------+--------+
| | |
| obsolescence | |
| report | |
|------------------>| |
| | getObsolescence |
| | Report() |
| |--------------------->|
| |<---------------------|
| at-risk formats | |
|<------------------| |
| | |
| create plan | |
|------------------>| |
| | createPlan() |
| | |
| plan_id | |
|<------------------| |
| | |
| add items by | |
| format | |
|------------------>| |
| | addItemsByFormat() |
| | getRecommendedTarget |
| |--------------------->|
| |<---------------------|
| | |
| | INSERT plan_items |
| | |
| items added | |
|<------------------| |
| | |
| approve & execute | |
|------------------>| |
| | executePlan() |
| | |
| | For each item: |
| | convertFormat() |
| | UPDATE status |
| | |
| results | |
|<------------------| |
Version History¶
| Version | Date | Changes |
|---|---|---|
| 1.0.0 | 2026-01 | Initial release with checksums, fixity, events |
| 1.1.0 | 2026-01 | Added virus scanning (ClamAV), format conversion (ImageMagick/FFmpeg/LibreOffice/Ghostscript), backup verification, replication targets, CLI tasks, settings UI |
| 1.2.0 | 2026-01 | Added Siegfried/PRONOM format identification with PUID tracking, confidence levels, batch identification CLI task, identification UI dashboard, auto-population of format registry |
| 1.3.0 | 2026-01 | Added Workflow Scheduler UI for configuring and monitoring automated preservation tasks, preservationSchedulerTask CLI task, schedule management service methods, run history tracking |
| 1.4.0 | 2026-01 | Added Format Migration subsystem: migration pathways, migration plans, obsolescence reporting, MigrationPathwayService, MigrationPlanService, preservationMigrationTask CLI |
Technical Documentation - Last Updated: January 2026 Plugin Version: 1.4.0