Backup & Recovery

Comprehensive data protection strategies including backup procedures, disaster recovery planning, and data restoration for business continuity.

Overview

Data protection is critical for maintaining business continuity and protecting against data loss. SysManage backup and recovery procedures ensure that your monitoring infrastructure and historical data remain protected and can be quickly restored in case of system failures, disasters, or data corruption.

Backup Strategy Components

Database Backups: Regular backups of all monitoring data and configurations
Configuration Backups: System configuration files and settings
Certificate Backups: SSL/TLS certificates and private keys
Log Backups: Critical log files and audit trails
Code and Scripts: Custom scripts and automation code
Documentation: System documentation and procedures

Database Backup Procedures

PostgreSQL Database Backup

SysManage uses PostgreSQL as its primary database. Regular backups are essential for data protection.

Full Database Backup

#!/bin/bash
# Full database backup script

BACKUP_DIR="/opt/sysmanage/backups"
DATE=$(date +%Y%m%d_%H%M%S)
DB_NAME="sysmanage"
DB_USER="sysmanage"

# Create backup directory
mkdir -p $BACKUP_DIR

# Perform full backup
pg_dump -h localhost -U $DB_USER -d $DB_NAME \
    --verbose --clean --create --if-exists \
    --format=custom \
    --file="$BACKUP_DIR/sysmanage_full_$DATE.backup"

# Compress the backup
gzip "$BACKUP_DIR/sysmanage_full_$DATE.backup"

# Verify backup integrity
pg_restore --list "$BACKUP_DIR/sysmanage_full_$DATE.backup.gz" > /dev/null

if [ $? -eq 0 ]; then
    echo "Backup completed successfully: sysmanage_full_$DATE.backup.gz"
else
    echo "Backup verification failed!"
    exit 1
fi

Incremental Backup with WAL Archiving

# PostgreSQL configuration for WAL archiving
# Add to postgresql.conf

wal_level = replica
archive_mode = on
archive_command = 'cp %p /opt/sysmanage/wal_archive/%f'
max_wal_senders = 3
wal_keep_segments = 64

Point-in-Time Recovery Setup

#!/bin/bash
# Base backup for PITR

BACKUP_DIR="/opt/sysmanage/backups"
DATE=$(date +%Y%m%d_%H%M%S)

# Create base backup
pg_basebackup -h localhost -U postgres \
    --pgdata="$BACKUP_DIR/basebackup_$DATE" \
    --format=tar \
    --wal-method=stream \
    --checkpoint=fast \
    --progress \
    --verbose

echo "Base backup completed: basebackup_$DATE"

Automated Backup Scheduling

Cron Schedule Example

# Add to crontab for automated backups

# Full backup daily at 2 AM
0 2 * * * /opt/sysmanage/scripts/backup_database.sh

# Configuration backup weekly on Sunday at 3 AM
0 3 * * 0 /opt/sysmanage/scripts/backup_config.sh

# Log rotation and backup daily at 4 AM
0 4 * * * /opt/sysmanage/scripts/backup_logs.sh

# Cleanup old backups weekly on Monday at 5 AM
0 5 * * 1 /opt/sysmanage/scripts/cleanup_backups.sh

Backup Verification Script

#!/bin/bash
# Backup verification and notification

BACKUP_DIR="/opt/sysmanage/backups"
TODAY=$(date +%Y%m%d)

# Check if today's backup exists
if [ -f "$BACKUP_DIR/sysmanage_full_${TODAY}_*.backup.gz" ]; then
    echo "✓ Database backup found for today"

    # Test restore capability
    pg_restore --list "$BACKUP_DIR/sysmanage_full_${TODAY}_*.backup.gz" > /dev/null

    if [ $? -eq 0 ]; then
        echo "✓ Backup integrity verified"
    else
        echo "✗ Backup integrity check failed"
        # Send alert
        mail -s "SysManage Backup Integrity Alert" admin@example.com < /dev/null
    fi
else
    echo "✗ No backup found for today"
    # Send alert
    mail -s "SysManage Backup Missing Alert" admin@example.com < /dev/null
fi

Configuration Backup

System Configuration Files

Critical Configuration Files

Application Config: /opt/sysmanage/config/
Nginx Configuration: /etc/nginx/sites-available/sysmanage
PostgreSQL Config: /etc/postgresql/*/main/
SSL Certificates: /opt/sysmanage/ssl/
Systemd Services: /etc/systemd/system/sysmanage*
Log Configuration: /opt/sysmanage/logs/

Configuration Backup Script

#!/bin/bash
# Configuration backup script

BACKUP_DIR="/opt/sysmanage/backups/config"
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="sysmanage_config_$DATE.tar.gz"

mkdir -p $BACKUP_DIR

# Create configuration backup
tar -czf "$BACKUP_DIR/$BACKUP_FILE" \
    /opt/sysmanage/config/ \
    /etc/nginx/sites-available/sysmanage \
    /etc/postgresql/*/main/ \
    /opt/sysmanage/ssl/ \
    /etc/systemd/system/sysmanage* \
    --exclude="*.log" \
    --exclude="*.pid"

# Verify backup
if [ -f "$BACKUP_DIR/$BACKUP_FILE" ]; then
    echo "Configuration backup completed: $BACKUP_FILE"
    # Test archive integrity
    tar -tzf "$BACKUP_DIR/$BACKUP_FILE" > /dev/null

    if [ $? -eq 0 ]; then
        echo "Configuration backup verified successfully"
    else
        echo "Configuration backup verification failed"
        exit 1
    fi
else
    echo "Configuration backup failed"
    exit 1
fi

Certificate Management

SSL Certificate Backup

#!/bin/bash
# SSL certificate backup and monitoring

CERT_DIR="/opt/sysmanage/ssl"
BACKUP_DIR="/opt/sysmanage/backups/certificates"
DATE=$(date +%Y%m%d_%H%M%S)

mkdir -p $BACKUP_DIR

# Backup certificates with encryption
tar -czf - -C $CERT_DIR . | \
    openssl enc -aes-256-cbc -salt -k "your-encryption-key" > \
    "$BACKUP_DIR/certificates_$DATE.tar.gz.enc"

# Check certificate expiration
for cert in $CERT_DIR/*.crt $CERT_DIR/*.pem; do
    if [ -f "$cert" ]; then
        expiry=$(openssl x509 -in "$cert" -noout -enddate | cut -d= -f2)
        expiry_epoch=$(date -d "$expiry" +%s)
        current_epoch=$(date +%s)
        days_until_expiry=$(( (expiry_epoch - current_epoch) / 86400 ))

        if [ $days_until_expiry -lt 30 ]; then
            echo "WARNING: Certificate $cert expires in $days_until_expiry days"
            # Send notification
            mail -s "Certificate Expiration Warning" admin@example.com <<< \
                "Certificate $cert expires in $days_until_expiry days on $expiry"
        fi
    fi
done

Disaster Recovery Planning

Recovery Strategy

Recovery Time Objectives (RTO)

Critical Systems: 2 hours maximum downtime
Standard Systems: 8 hours maximum downtime
Development Systems: 24 hours maximum downtime
Archive Systems: 72 hours maximum downtime

Recovery Point Objectives (RPO)

Critical Data: 15 minutes maximum data loss
Standard Data: 4 hours maximum data loss
Archive Data: 24 hours maximum data loss

Disaster Scenarios

Hardware Failure: Server or storage system failure
Software Corruption: Application or database corruption
Natural Disasters: Fire, flood, earthquake, power outage
Cyber Attacks: Ransomware, data breaches, DDoS attacks
Human Error: Accidental deletion or misconfiguration

High Availability Setup

Database High Availability

# PostgreSQL streaming replication setup
# Master server configuration (postgresql.conf)

listen_addresses = '*'
wal_level = replica
max_wal_senders = 3
wal_keep_segments = 64
synchronous_commit = on
synchronous_standby_names = 'standby1'

# Standby server recovery configuration (recovery.conf)

standby_mode = 'on'
primary_conninfo = 'host=master-server port=5432 user=replicator'
trigger_file = '/opt/sysmanage/failover_trigger'
restore_command = 'cp /opt/sysmanage/wal_archive/%f %p'

Application Load Balancing

# Nginx load balancer configuration

upstream sysmanage_backend {
    server 192.168.1.10:8443 weight=3 max_fails=3 fail_timeout=30s;
    server 192.168.1.11:8443 weight=2 max_fails=3 fail_timeout=30s backup;
}

server {
    listen 443 ssl http2;
    server_name sysmanage.example.com;

    location / {
        proxy_pass https://sysmanage_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Health check
        proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;
        proxy_connect_timeout 5s;
        proxy_send_timeout 10s;
        proxy_read_timeout 10s;
    }
}

Failover Procedures

Automatic Failover Script

#!/bin/bash
# Automatic failover script for database

MASTER_HOST="192.168.1.10"
STANDBY_HOST="192.168.1.11"
TRIGGER_FILE="/opt/sysmanage/failover_trigger"
LOG_FILE="/var/log/sysmanage_failover.log"

# Check master database availability
pg_isready -h $MASTER_HOST -p 5432 -U sysmanage

if [ $? -ne 0 ]; then
    echo "$(date): Master database not responding, initiating failover" >> $LOG_FILE

    # Trigger failover on standby
    ssh postgres@$STANDBY_HOST "touch $TRIGGER_FILE"

    if [ $? -eq 0 ]; then
        echo "$(date): Failover triggered successfully" >> $LOG_FILE

        # Update DNS or load balancer to point to standby
        # Update application configuration
        # Send notifications

        mail -s "SysManage Failover Initiated" admin@example.com <<< \
            "Database failover has been initiated. Standby server is now primary."
    else
        echo "$(date): Failed to trigger failover" >> $LOG_FILE
    fi
fi

Manual Failover Checklist

Assess the situation and confirm need for failover
Notify stakeholders of planned failover
Stop application services on primary server
Verify data synchronization on standby
Promote standby to primary role
Update DNS records or load balancer configuration
Start application services on new primary
Verify system functionality
Document the failover process and issues
Plan for recovery of original primary server

Data Restoration Procedures

Database Restoration

Full Database Restore

#!/bin/bash
# Full database restore script

BACKUP_FILE="$1"
DB_NAME="sysmanage"
DB_USER="sysmanage"

if [ -z "$BACKUP_FILE" ]; then
    echo "Usage: $0 "
    exit 1
fi

# Verify backup file exists
if [ ! -f "$BACKUP_FILE" ]; then
    echo "Backup file not found: $BACKUP_FILE"
    exit 1
fi

# Stop SysManage services
systemctl stop sysmanage-backend
systemctl stop sysmanage-worker

# Drop existing database (be careful!)
read -p "Are you sure you want to drop the existing database? (yes/no): " confirm
if [ "$confirm" = "yes" ]; then
    dropdb -h localhost -U postgres $DB_NAME

    # Restore database
    pg_restore -h localhost -U postgres \
        --create --verbose \
        --dbname=postgres \
        "$BACKUP_FILE"

    if [ $? -eq 0 ]; then
        echo "Database restored successfully"

        # Start services
        systemctl start sysmanage-backend
        systemctl start sysmanage-worker

        echo "SysManage services restarted"
    else
        echo "Database restore failed"
        exit 1
    fi
else
    echo "Database restore cancelled"
    exit 1
fi

Point-in-Time Recovery

#!/bin/bash
# Point-in-time recovery script

BACKUP_DIR="$1"
RECOVERY_TIME="$2"
DATA_DIR="/var/lib/postgresql/12/main"

if [ -z "$BACKUP_DIR" ] || [ -z "$RECOVERY_TIME" ]; then
    echo "Usage: $0  "
    echo "Example: $0 /opt/backups/basebackup_20240315 '2024-03-15 14:30:00'"
    exit 1
fi

# Stop PostgreSQL
systemctl stop postgresql

# Remove current data directory
rm -rf $DATA_DIR/*

# Extract base backup
tar -xf "$BACKUP_DIR/base.tar" -C $DATA_DIR

# Create recovery configuration
cat > $DATA_DIR/recovery.conf << EOF
restore_command = 'cp /opt/sysmanage/wal_archive/%f %p'
recovery_target_time = '$RECOVERY_TIME'
recovery_target_action = 'promote'
EOF

# Set permissions
chown -R postgres:postgres $DATA_DIR
chmod 700 $DATA_DIR

# Start PostgreSQL
systemctl start postgresql

echo "Point-in-time recovery initiated to: $RECOVERY_TIME"

Configuration Restoration

System Configuration Restore

#!/bin/bash
# Configuration restore script

CONFIG_BACKUP="$1"
RESTORE_DIR="/opt/sysmanage/restore"

if [ -z "$CONFIG_BACKUP" ]; then
    echo "Usage: $0 "
    exit 1
fi

# Create restore directory
mkdir -p $RESTORE_DIR

# Extract configuration backup
tar -xzf "$CONFIG_BACKUP" -C $RESTORE_DIR

# Stop services
systemctl stop sysmanage-backend
systemctl stop sysmanage-worker
systemctl stop nginx

# Backup current configuration
cp -r /opt/sysmanage/config /opt/sysmanage/config.backup.$(date +%Y%m%d_%H%M%S)

# Restore configuration files
cp -r $RESTORE_DIR/opt/sysmanage/config/* /opt/sysmanage/config/
cp $RESTORE_DIR/etc/nginx/sites-available/sysmanage /etc/nginx/sites-available/
cp -r $RESTORE_DIR/opt/sysmanage/ssl/* /opt/sysmanage/ssl/

# Set proper permissions
chown -R sysmanage:sysmanage /opt/sysmanage/config
chown -R sysmanage:sysmanage /opt/sysmanage/ssl
chmod 600 /opt/sysmanage/ssl/*.key

# Reload systemd and restart services
systemctl daemon-reload
systemctl start sysmanage-backend
systemctl start sysmanage-worker
systemctl start nginx

echo "Configuration restored successfully"

Backup Testing and Validation

Regular Restore Testing

Monthly Restore Test Procedure

Select a recent backup for testing
Set up an isolated test environment
Perform full system restore on test environment
Verify data integrity and completeness
Test application functionality
Document any issues or improvements needed
Update restore procedures if necessary

Automated Backup Validation

#!/bin/bash
# Automated backup validation script

BACKUP_DIR="/opt/sysmanage/backups"
TEST_DB="sysmanage_test"
VALIDATION_LOG="/var/log/backup_validation.log"

echo "$(date): Starting backup validation" >> $VALIDATION_LOG

# Find latest backup
LATEST_BACKUP=$(ls -t $BACKUP_DIR/sysmanage_full_*.backup.gz | head -n1)

if [ -z "$LATEST_BACKUP" ]; then
    echo "$(date): No backup files found" >> $VALIDATION_LOG
    exit 1
fi

# Create test database
createdb -U postgres $TEST_DB

# Restore backup to test database
pg_restore -U postgres -d $TEST_DB "$LATEST_BACKUP"

if [ $? -eq 0 ]; then
    echo "$(date): Backup restore successful: $LATEST_BACKUP" >> $VALIDATION_LOG

    # Basic data validation
    HOST_COUNT=$(psql -U postgres -d $TEST_DB -t -c "SELECT COUNT(*) FROM hosts;")
    USER_COUNT=$(psql -U postgres -d $TEST_DB -t -c "SELECT COUNT(*) FROM users;")

    echo "$(date): Hosts: $HOST_COUNT, Users: $USER_COUNT" >> $VALIDATION_LOG

    # Cleanup test database
    dropdb -U postgres $TEST_DB

    echo "$(date): Backup validation completed successfully" >> $VALIDATION_LOG
else
    echo "$(date): Backup validation failed for: $LATEST_BACKUP" >> $VALIDATION_LOG
    # Send alert
    mail -s "Backup Validation Failed" admin@example.com < $VALIDATION_LOG
fi

Disaster Recovery Testing

Annual DR Test Schedule

Q1: Database failover test
Q2: Complete system recovery test
Q3: Network isolation recovery test
Q4: Full disaster simulation

DR Test Checklist

□ Backup integrity verification
□ Network connectivity testing
□ Database recovery testing
□ Application functionality testing
□ Performance baseline verification
□ User access testing
□ Monitoring system verification
□ Documentation updates

Backup Best Practices

Strategy Best Practices

3-2-1 Rule: 3 copies of data, 2 different media types, 1 offsite copy
Regular Testing: Test backups regularly to ensure recoverability
Automation: Automate backup processes to reduce human error
Monitoring: Monitor backup jobs and alert on failures
Documentation: Document all procedures and keep them current
Security: Encrypt backups and secure storage locations

Operational Best Practices

Retention Policies: Define and enforce appropriate retention periods
Storage Management: Monitor storage capacity and plan for growth
Performance Impact: Schedule backups during low-activity periods
Version Control: Maintain multiple backup versions for different recovery scenarios
Access Control: Limit access to backup systems and data

Security Best Practices

Encryption: Encrypt backups both in transit and at rest
Access Logging: Log all access to backup systems
Separation: Keep backup systems separate from production
Verification: Verify backup integrity and authenticity
Incident Response: Include backup systems in incident response plans