Backup & Recovery
Comprehensive data protection strategies including backup procedures, disaster recovery planning, and data restoration for business continuity.
Overview
Data protection is critical for maintaining business continuity and protecting against data loss. SysManage backup and recovery procedures ensure that your monitoring infrastructure and historical data remain protected and can be quickly restored in case of system failures, disasters, or data corruption.
Backup Strategy Components
- Database Backups: Regular backups of all monitoring data and configurations
- Configuration Backups: System configuration files and settings
- Certificate Backups: SSL/TLS certificates and private keys
- Log Backups: Critical log files and audit trails
- Code and Scripts: Custom scripts and automation code
- Documentation: System documentation and procedures
Database Backup Procedures
PostgreSQL Database Backup
SysManage uses PostgreSQL as its primary database. Regular backups are essential for data protection.
Full Database Backup
#!/bin/bash
# Full database backup script
BACKUP_DIR="/opt/sysmanage/backups"
DATE=$(date +%Y%m%d_%H%M%S)
DB_NAME="sysmanage"
DB_USER="sysmanage"
# Create backup directory
mkdir -p $BACKUP_DIR
# Perform full backup
pg_dump -h localhost -U $DB_USER -d $DB_NAME \
--verbose --clean --create --if-exists \
--format=custom \
--file="$BACKUP_DIR/sysmanage_full_$DATE.backup"
# Compress the backup
gzip "$BACKUP_DIR/sysmanage_full_$DATE.backup"
# Verify backup integrity
pg_restore --list "$BACKUP_DIR/sysmanage_full_$DATE.backup.gz" > /dev/null
if [ $? -eq 0 ]; then
echo "Backup completed successfully: sysmanage_full_$DATE.backup.gz"
else
echo "Backup verification failed!"
exit 1
fi
Incremental Backup with WAL Archiving
# PostgreSQL configuration for WAL archiving
# Add to postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'cp %p /opt/sysmanage/wal_archive/%f'
max_wal_senders = 3
wal_keep_segments = 64
Point-in-Time Recovery Setup
#!/bin/bash
# Base backup for PITR
BACKUP_DIR="/opt/sysmanage/backups"
DATE=$(date +%Y%m%d_%H%M%S)
# Create base backup
pg_basebackup -h localhost -U postgres \
--pgdata="$BACKUP_DIR/basebackup_$DATE" \
--format=tar \
--wal-method=stream \
--checkpoint=fast \
--progress \
--verbose
echo "Base backup completed: basebackup_$DATE"
Automated Backup Scheduling
Cron Schedule Example
# Add to crontab for automated backups
# Full backup daily at 2 AM
0 2 * * * /opt/sysmanage/scripts/backup_database.sh
# Configuration backup weekly on Sunday at 3 AM
0 3 * * 0 /opt/sysmanage/scripts/backup_config.sh
# Log rotation and backup daily at 4 AM
0 4 * * * /opt/sysmanage/scripts/backup_logs.sh
# Cleanup old backups weekly on Monday at 5 AM
0 5 * * 1 /opt/sysmanage/scripts/cleanup_backups.sh
Backup Verification Script
#!/bin/bash
# Backup verification and notification
BACKUP_DIR="/opt/sysmanage/backups"
TODAY=$(date +%Y%m%d)
# Check if today's backup exists
if [ -f "$BACKUP_DIR/sysmanage_full_${TODAY}_*.backup.gz" ]; then
echo "✓ Database backup found for today"
# Test restore capability
pg_restore --list "$BACKUP_DIR/sysmanage_full_${TODAY}_*.backup.gz" > /dev/null
if [ $? -eq 0 ]; then
echo "✓ Backup integrity verified"
else
echo "✗ Backup integrity check failed"
# Send alert
mail -s "SysManage Backup Integrity Alert" admin@example.com < /dev/null
fi
else
echo "✗ No backup found for today"
# Send alert
mail -s "SysManage Backup Missing Alert" admin@example.com < /dev/null
fi
Configuration Backup
System Configuration Files
Critical Configuration Files
- Application Config:
/opt/sysmanage/config/
- Nginx Configuration:
/etc/nginx/sites-available/sysmanage
- PostgreSQL Config:
/etc/postgresql/*/main/
- SSL Certificates:
/opt/sysmanage/ssl/
- Systemd Services:
/etc/systemd/system/sysmanage*
- Log Configuration:
/opt/sysmanage/logs/
Configuration Backup Script
#!/bin/bash
# Configuration backup script
BACKUP_DIR="/opt/sysmanage/backups/config"
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="sysmanage_config_$DATE.tar.gz"
mkdir -p $BACKUP_DIR
# Create configuration backup
tar -czf "$BACKUP_DIR/$BACKUP_FILE" \
/opt/sysmanage/config/ \
/etc/nginx/sites-available/sysmanage \
/etc/postgresql/*/main/ \
/opt/sysmanage/ssl/ \
/etc/systemd/system/sysmanage* \
--exclude="*.log" \
--exclude="*.pid"
# Verify backup
if [ -f "$BACKUP_DIR/$BACKUP_FILE" ]; then
echo "Configuration backup completed: $BACKUP_FILE"
# Test archive integrity
tar -tzf "$BACKUP_DIR/$BACKUP_FILE" > /dev/null
if [ $? -eq 0 ]; then
echo "Configuration backup verified successfully"
else
echo "Configuration backup verification failed"
exit 1
fi
else
echo "Configuration backup failed"
exit 1
fi
Certificate Management
SSL Certificate Backup
#!/bin/bash
# SSL certificate backup and monitoring
CERT_DIR="/opt/sysmanage/ssl"
BACKUP_DIR="/opt/sysmanage/backups/certificates"
DATE=$(date +%Y%m%d_%H%M%S)
mkdir -p $BACKUP_DIR
# Backup certificates with encryption
tar -czf - -C $CERT_DIR . | \
openssl enc -aes-256-cbc -salt -k "your-encryption-key" > \
"$BACKUP_DIR/certificates_$DATE.tar.gz.enc"
# Check certificate expiration
for cert in $CERT_DIR/*.crt $CERT_DIR/*.pem; do
if [ -f "$cert" ]; then
expiry=$(openssl x509 -in "$cert" -noout -enddate | cut -d= -f2)
expiry_epoch=$(date -d "$expiry" +%s)
current_epoch=$(date +%s)
days_until_expiry=$(( (expiry_epoch - current_epoch) / 86400 ))
if [ $days_until_expiry -lt 30 ]; then
echo "WARNING: Certificate $cert expires in $days_until_expiry days"
# Send notification
mail -s "Certificate Expiration Warning" admin@example.com <<< \
"Certificate $cert expires in $days_until_expiry days on $expiry"
fi
fi
done
Disaster Recovery Planning
Recovery Strategy
Recovery Time Objectives (RTO)
- Critical Systems: 2 hours maximum downtime
- Standard Systems: 8 hours maximum downtime
- Development Systems: 24 hours maximum downtime
- Archive Systems: 72 hours maximum downtime
Recovery Point Objectives (RPO)
- Critical Data: 15 minutes maximum data loss
- Standard Data: 4 hours maximum data loss
- Archive Data: 24 hours maximum data loss
Disaster Scenarios
- Hardware Failure: Server or storage system failure
- Software Corruption: Application or database corruption
- Natural Disasters: Fire, flood, earthquake, power outage
- Cyber Attacks: Ransomware, data breaches, DDoS attacks
- Human Error: Accidental deletion or misconfiguration
High Availability Setup
Database High Availability
# PostgreSQL streaming replication setup
# Master server configuration (postgresql.conf)
listen_addresses = '*'
wal_level = replica
max_wal_senders = 3
wal_keep_segments = 64
synchronous_commit = on
synchronous_standby_names = 'standby1'
# Standby server recovery configuration (recovery.conf)
standby_mode = 'on'
primary_conninfo = 'host=master-server port=5432 user=replicator'
trigger_file = '/opt/sysmanage/failover_trigger'
restore_command = 'cp /opt/sysmanage/wal_archive/%f %p'
Application Load Balancing
# Nginx load balancer configuration
upstream sysmanage_backend {
server 192.168.1.10:8443 weight=3 max_fails=3 fail_timeout=30s;
server 192.168.1.11:8443 weight=2 max_fails=3 fail_timeout=30s backup;
}
server {
listen 443 ssl http2;
server_name sysmanage.example.com;
location / {
proxy_pass https://sysmanage_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Health check
proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;
proxy_connect_timeout 5s;
proxy_send_timeout 10s;
proxy_read_timeout 10s;
}
}
Failover Procedures
Automatic Failover Script
#!/bin/bash
# Automatic failover script for database
MASTER_HOST="192.168.1.10"
STANDBY_HOST="192.168.1.11"
TRIGGER_FILE="/opt/sysmanage/failover_trigger"
LOG_FILE="/var/log/sysmanage_failover.log"
# Check master database availability
pg_isready -h $MASTER_HOST -p 5432 -U sysmanage
if [ $? -ne 0 ]; then
echo "$(date): Master database not responding, initiating failover" >> $LOG_FILE
# Trigger failover on standby
ssh postgres@$STANDBY_HOST "touch $TRIGGER_FILE"
if [ $? -eq 0 ]; then
echo "$(date): Failover triggered successfully" >> $LOG_FILE
# Update DNS or load balancer to point to standby
# Update application configuration
# Send notifications
mail -s "SysManage Failover Initiated" admin@example.com <<< \
"Database failover has been initiated. Standby server is now primary."
else
echo "$(date): Failed to trigger failover" >> $LOG_FILE
fi
fi
Manual Failover Checklist
- Assess the situation and confirm need for failover
- Notify stakeholders of planned failover
- Stop application services on primary server
- Verify data synchronization on standby
- Promote standby to primary role
- Update DNS records or load balancer configuration
- Start application services on new primary
- Verify system functionality
- Document the failover process and issues
- Plan for recovery of original primary server
Data Restoration Procedures
Database Restoration
Full Database Restore
#!/bin/bash
# Full database restore script
BACKUP_FILE="$1"
DB_NAME="sysmanage"
DB_USER="sysmanage"
if [ -z "$BACKUP_FILE" ]; then
echo "Usage: $0 "
exit 1
fi
# Verify backup file exists
if [ ! -f "$BACKUP_FILE" ]; then
echo "Backup file not found: $BACKUP_FILE"
exit 1
fi
# Stop SysManage services
systemctl stop sysmanage-backend
systemctl stop sysmanage-worker
# Drop existing database (be careful!)
read -p "Are you sure you want to drop the existing database? (yes/no): " confirm
if [ "$confirm" = "yes" ]; then
dropdb -h localhost -U postgres $DB_NAME
# Restore database
pg_restore -h localhost -U postgres \
--create --verbose \
--dbname=postgres \
"$BACKUP_FILE"
if [ $? -eq 0 ]; then
echo "Database restored successfully"
# Start services
systemctl start sysmanage-backend
systemctl start sysmanage-worker
echo "SysManage services restarted"
else
echo "Database restore failed"
exit 1
fi
else
echo "Database restore cancelled"
exit 1
fi
Point-in-Time Recovery
#!/bin/bash
# Point-in-time recovery script
BACKUP_DIR="$1"
RECOVERY_TIME="$2"
DATA_DIR="/var/lib/postgresql/12/main"
if [ -z "$BACKUP_DIR" ] || [ -z "$RECOVERY_TIME" ]; then
echo "Usage: $0 "
echo "Example: $0 /opt/backups/basebackup_20240315 '2024-03-15 14:30:00'"
exit 1
fi
# Stop PostgreSQL
systemctl stop postgresql
# Remove current data directory
rm -rf $DATA_DIR/*
# Extract base backup
tar -xf "$BACKUP_DIR/base.tar" -C $DATA_DIR
# Create recovery configuration
cat > $DATA_DIR/recovery.conf << EOF
restore_command = 'cp /opt/sysmanage/wal_archive/%f %p'
recovery_target_time = '$RECOVERY_TIME'
recovery_target_action = 'promote'
EOF
# Set permissions
chown -R postgres:postgres $DATA_DIR
chmod 700 $DATA_DIR
# Start PostgreSQL
systemctl start postgresql
echo "Point-in-time recovery initiated to: $RECOVERY_TIME"
Configuration Restoration
System Configuration Restore
#!/bin/bash
# Configuration restore script
CONFIG_BACKUP="$1"
RESTORE_DIR="/opt/sysmanage/restore"
if [ -z "$CONFIG_BACKUP" ]; then
echo "Usage: $0 "
exit 1
fi
# Create restore directory
mkdir -p $RESTORE_DIR
# Extract configuration backup
tar -xzf "$CONFIG_BACKUP" -C $RESTORE_DIR
# Stop services
systemctl stop sysmanage-backend
systemctl stop sysmanage-worker
systemctl stop nginx
# Backup current configuration
cp -r /opt/sysmanage/config /opt/sysmanage/config.backup.$(date +%Y%m%d_%H%M%S)
# Restore configuration files
cp -r $RESTORE_DIR/opt/sysmanage/config/* /opt/sysmanage/config/
cp $RESTORE_DIR/etc/nginx/sites-available/sysmanage /etc/nginx/sites-available/
cp -r $RESTORE_DIR/opt/sysmanage/ssl/* /opt/sysmanage/ssl/
# Set proper permissions
chown -R sysmanage:sysmanage /opt/sysmanage/config
chown -R sysmanage:sysmanage /opt/sysmanage/ssl
chmod 600 /opt/sysmanage/ssl/*.key
# Reload systemd and restart services
systemctl daemon-reload
systemctl start sysmanage-backend
systemctl start sysmanage-worker
systemctl start nginx
echo "Configuration restored successfully"
Backup Testing and Validation
Regular Restore Testing
Monthly Restore Test Procedure
- Select a recent backup for testing
- Set up an isolated test environment
- Perform full system restore on test environment
- Verify data integrity and completeness
- Test application functionality
- Document any issues or improvements needed
- Update restore procedures if necessary
Automated Backup Validation
#!/bin/bash
# Automated backup validation script
BACKUP_DIR="/opt/sysmanage/backups"
TEST_DB="sysmanage_test"
VALIDATION_LOG="/var/log/backup_validation.log"
echo "$(date): Starting backup validation" >> $VALIDATION_LOG
# Find latest backup
LATEST_BACKUP=$(ls -t $BACKUP_DIR/sysmanage_full_*.backup.gz | head -n1)
if [ -z "$LATEST_BACKUP" ]; then
echo "$(date): No backup files found" >> $VALIDATION_LOG
exit 1
fi
# Create test database
createdb -U postgres $TEST_DB
# Restore backup to test database
pg_restore -U postgres -d $TEST_DB "$LATEST_BACKUP"
if [ $? -eq 0 ]; then
echo "$(date): Backup restore successful: $LATEST_BACKUP" >> $VALIDATION_LOG
# Basic data validation
HOST_COUNT=$(psql -U postgres -d $TEST_DB -t -c "SELECT COUNT(*) FROM hosts;")
USER_COUNT=$(psql -U postgres -d $TEST_DB -t -c "SELECT COUNT(*) FROM users;")
echo "$(date): Hosts: $HOST_COUNT, Users: $USER_COUNT" >> $VALIDATION_LOG
# Cleanup test database
dropdb -U postgres $TEST_DB
echo "$(date): Backup validation completed successfully" >> $VALIDATION_LOG
else
echo "$(date): Backup validation failed for: $LATEST_BACKUP" >> $VALIDATION_LOG
# Send alert
mail -s "Backup Validation Failed" admin@example.com < $VALIDATION_LOG
fi
Disaster Recovery Testing
Annual DR Test Schedule
- Q1: Database failover test
- Q2: Complete system recovery test
- Q3: Network isolation recovery test
- Q4: Full disaster simulation
DR Test Checklist
- □ Backup integrity verification
- □ Network connectivity testing
- □ Database recovery testing
- □ Application functionality testing
- □ Performance baseline verification
- □ User access testing
- □ Monitoring system verification
- □ Documentation updates
Backup Best Practices
Strategy Best Practices
- 3-2-1 Rule: 3 copies of data, 2 different media types, 1 offsite copy
- Regular Testing: Test backups regularly to ensure recoverability
- Automation: Automate backup processes to reduce human error
- Monitoring: Monitor backup jobs and alert on failures
- Documentation: Document all procedures and keep them current
- Security: Encrypt backups and secure storage locations
Operational Best Practices
- Retention Policies: Define and enforce appropriate retention periods
- Storage Management: Monitor storage capacity and plan for growth
- Performance Impact: Schedule backups during low-activity periods
- Version Control: Maintain multiple backup versions for different recovery scenarios
- Access Control: Limit access to backup systems and data
Security Best Practices
- Encryption: Encrypt backups both in transit and at rest
- Access Logging: Log all access to backup systems
- Separation: Keep backup systems separate from production
- Verification: Verify backup integrity and authenticity
- Incident Response: Include backup systems in incident response plans