Documentation > Administration > Maintenance

Maintenance

Regular maintenance procedures, system updates, performance optimization, and capacity planning to ensure optimal SysManage performance and reliability.

Overview

Regular maintenance is essential for keeping SysManage running optimally and preventing issues before they impact operations. This includes routine system updates, performance monitoring, database maintenance, log management, and capacity planning to ensure the system scales effectively with your infrastructure growth.

Types of Maintenance

  • Preventive Maintenance: Scheduled tasks to prevent problems
  • Corrective Maintenance: Fixes for identified issues
  • Adaptive Maintenance: Updates for changing requirements
  • Perfective Maintenance: Performance and efficiency improvements
  • Emergency Maintenance: Critical issues requiring immediate attention

Maintenance Schedules

Daily Maintenance Tasks

Automated Daily Tasks

  • System Health Check: Verify all services are running
  • Database Backup: Automated daily backup verification
  • Log Rotation: Rotate and compress log files
  • Disk Space Monitoring: Check available disk space
  • Certificate Monitoring: Check SSL certificate expiration
  • Performance Metrics: Collect and analyze performance data

Daily Health Check Script

#!/bin/bash
# Daily SysManage health check

LOG_FILE="/var/log/sysmanage_health_check.log"
DATE=$(date)

echo "=== SysManage Health Check - $DATE ===" >> $LOG_FILE

# Check service status
services=("sysmanage-backend" "sysmanage-worker" "postgresql" "nginx")
for service in "${services[@]}"; do
    if systemctl is-active --quiet $service; then
        echo "✓ $service is running" >> $LOG_FILE
    else
        echo "✗ $service is not running" >> $LOG_FILE
        # Send alert
        mail -s "Service Down: $service" admin@example.com <<< "Service $service is not running on $(hostname)"
    fi
done

# Check database connectivity
if pg_isready -h localhost -p 5432 -U sysmanage; then
    echo "✓ Database is accessible" >> $LOG_FILE
else
    echo "✗ Database is not accessible" >> $LOG_FILE
    mail -s "Database Connection Failed" admin@example.com <<< "Database connection failed on $(hostname)"
fi

# Check disk space
df -h | awk '$5 > 85 {print "✗ Disk space warning: " $0}' >> $LOG_FILE

# Check memory usage
memory_usage=$(free | awk 'NR==2{printf "%.2f", $3*100/$2}')
if (( $(echo "$memory_usage > 90" | bc -l) )); then
    echo "✗ Memory usage high: ${memory_usage}%" >> $LOG_FILE
fi

echo "=== Health Check Complete ===" >> $LOG_FILE

Weekly Maintenance Tasks

Scheduled Weekly Tasks

  • System Updates: Apply security patches and updates
  • Database Maintenance: VACUUM, ANALYZE, and index optimization
  • Log Analysis: Review system and application logs
  • Performance Review: Analyze performance trends
  • Backup Verification: Test backup integrity
  • Security Audit: Review security logs and access patterns

Database Maintenance Script

#!/bin/bash
# Weekly database maintenance

DB_NAME="sysmanage"
MAINTENANCE_LOG="/var/log/db_maintenance.log"

echo "$(date): Starting database maintenance" >> $MAINTENANCE_LOG

# Vacuum and analyze all tables
psql -U sysmanage -d $DB_NAME -c "VACUUM ANALYZE;" >> $MAINTENANCE_LOG 2>&1

# Reindex tables if needed
psql -U sysmanage -d $DB_NAME -c "REINDEX DATABASE $DB_NAME;" >> $MAINTENANCE_LOG 2>&1

# Update table statistics
psql -U sysmanage -d $DB_NAME -c "ANALYZE;" >> $MAINTENANCE_LOG 2>&1

# Check for bloated tables
psql -U sysmanage -d $DB_NAME -c "
    SELECT schemaname, tablename,
           pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
    FROM pg_tables
    WHERE schemaname = 'public'
    ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
    LIMIT 10;
" >> $MAINTENANCE_LOG 2>&1

echo "$(date): Database maintenance completed" >> $MAINTENANCE_LOG

Monthly Maintenance Tasks

Comprehensive Monthly Tasks

  • System Audit: Complete system configuration review
  • Capacity Planning: Analyze growth trends and capacity needs
  • Performance Optimization: Identify and address performance bottlenecks
  • Security Review: Comprehensive security assessment
  • Documentation Update: Update system documentation
  • Disaster Recovery Test: Test backup and recovery procedures
  • User Access Review: Audit user accounts and permissions

Monthly Report Generation

#!/bin/bash
# Generate monthly maintenance report

REPORT_DIR="/opt/sysmanage/reports"
MONTH=$(date +%Y-%m)
REPORT_FILE="$REPORT_DIR/monthly_report_$MONTH.html"

mkdir -p $REPORT_DIR

cat > $REPORT_FILE << EOF



    SysManage Monthly Report - $MONTH
    


    

SysManage Monthly Report - $MONTH

Generated on: $(date)

System Overview

Total Hosts: $(psql -U sysmanage -d sysmanage -t -c "SELECT COUNT(*) FROM hosts WHERE active=true;")
Active Users: $(psql -U sysmanage -d sysmanage -t -c "SELECT COUNT(*) FROM users WHERE active=true;")
Database Size: $(psql -U sysmanage -d sysmanage -t -c "SELECT pg_size_pretty(pg_database_size('sysmanage'));")

Performance Metrics

EOF # Add performance data to report echo "
MetricAveragePeakStatus
" >> $REPORT_FILE echo "" >> $REPORT_FILE echo "Monthly report generated: $REPORT_FILE"

System Updates and Patches

Update Strategy

Update Categories

  • Security Updates: Critical security patches (immediate)
  • Bug Fixes: Application and system bug fixes (weekly)
  • Feature Updates: New features and improvements (monthly)
  • Major Releases: Major version upgrades (quarterly)

Update Process

  1. Review available updates and release notes
  2. Test updates in staging environment
  3. Schedule maintenance window
  4. Create system backup before updates
  5. Apply updates in controlled manner
  6. Verify system functionality post-update
  7. Monitor for any issues or regressions
  8. Document update process and results

Update Automation Script

#!/bin/bash
# System update automation

UPDATE_LOG="/var/log/system_updates.log"
BACKUP_DIR="/opt/sysmanage/backups/pre_update"

echo "$(date): Starting system updates" >> $UPDATE_LOG

# Create pre-update backup
mkdir -p $BACKUP_DIR
DATE=$(date +%Y%m%d_%H%M%S)

# Backup configuration
tar -czf "$BACKUP_DIR/config_backup_$DATE.tar.gz" /opt/sysmanage/config/

# Update system packages
apt update >> $UPDATE_LOG 2>&1
DEBIAN_FRONTEND=noninteractive apt upgrade -y >> $UPDATE_LOG 2>&1

# Update SysManage application
if [ -f "/opt/sysmanage/scripts/update.sh" ]; then
    /opt/sysmanage/scripts/update.sh >> $UPDATE_LOG 2>&1
fi

# Restart services if needed
if [ -f "/var/run/reboot-required" ]; then
    echo "$(date): System reboot required after updates" >> $UPDATE_LOG
    # Schedule reboot or notify administrators
    mail -s "Reboot Required After Updates" admin@example.com <<< "System reboot required on $(hostname) after applying updates"
fi

echo "$(date): System updates completed" >> $UPDATE_LOG

Rollback Procedures

Application Rollback

#!/bin/bash
# Application rollback script

BACKUP_VERSION="$1"
APP_DIR="/opt/sysmanage"
BACKUP_DIR="/opt/sysmanage/backups"

if [ -z "$BACKUP_VERSION" ]; then
    echo "Usage: $0 "
    echo "Available backups:"
    ls -la $BACKUP_DIR/app_backup_*
    exit 1
fi

# Stop services
systemctl stop sysmanage-backend
systemctl stop sysmanage-worker

# Backup current version
cp -r $APP_DIR "$BACKUP_DIR/rollback_backup_$(date +%Y%m%d_%H%M%S)"

# Restore previous version
tar -xzf "$BACKUP_DIR/app_backup_$BACKUP_VERSION.tar.gz" -C /opt/

# Restore configuration if needed
if [ -f "$BACKUP_DIR/config_backup_$BACKUP_VERSION.tar.gz" ]; then
    tar -xzf "$BACKUP_DIR/config_backup_$BACKUP_VERSION.tar.gz" -C /
fi

# Restart services
systemctl start sysmanage-backend
systemctl start sysmanage-worker

echo "Rollback to version $BACKUP_VERSION completed"

Performance Optimization

Database Performance Optimization

Query Performance Analysis

-- Identify slow queries
SELECT query, mean_time, calls, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;

-- Find missing indexes
SELECT schemaname, tablename, attname, n_distinct, correlation
FROM pg_stats
WHERE schemaname = 'public'
  AND n_distinct > 100
  AND correlation < 0.1;

-- Check table bloat
SELECT schemaname, tablename,
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size,
       pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) as table_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

Index Optimization

-- Create indexes for common queries
CREATE INDEX CONCURRENTLY idx_metrics_timestamp
ON metrics (timestamp)
WHERE timestamp > NOW() - INTERVAL '30 days';

CREATE INDEX CONCURRENTLY idx_hosts_active
ON hosts (active, hostname)
WHERE active = true;

CREATE INDEX CONCURRENTLY idx_alerts_severity
ON alerts (severity, created_at)
WHERE resolved_at IS NULL;

Connection Pool Optimization

# PostgreSQL connection settings (postgresql.conf)

max_connections = 200
shared_buffers = 256MB
effective_cache_size = 1GB
work_mem = 4MB
maintenance_work_mem = 64MB

# Connection pooling with pgbouncer
[databases]
sysmanage = host=localhost port=5432 dbname=sysmanage

[pgbouncer]
listen_port = 6432
listen_addr = 127.0.0.1
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
default_pool_size = 25
max_client_conn = 200
reserve_pool_size = 5

Application Performance Optimization

Caching Strategy

# Redis configuration for caching
maxmemory 512mb
maxmemory-policy allkeys-lru

# Application caching configuration
CACHE_BACKEND = 'redis://localhost:6379/0'
CACHE_TIMEOUT = 300  # 5 minutes
CACHE_KEY_PREFIX = 'sysmanage:'

# Cache frequently accessed data
- Host information: 15 minutes
- User sessions: 1 hour
- System metrics: 5 minutes
- Configuration data: 30 minutes

Resource Optimization

  • Memory Management: Monitor memory usage and implement pagination
  • CPU Optimization: Profile CPU usage and optimize algorithms
  • I/O Optimization: Minimize disk I/O and use asynchronous operations
  • Network Optimization: Implement compression and connection pooling

Monitoring System Optimization

Data Collection Optimization

  • Collection Intervals: Adjust based on criticality and change frequency
  • Metric Granularity: Balance detail with storage requirements
  • Batch Processing: Collect multiple metrics in single operations
  • Data Compression: Compress data before storage and transmission

Storage Optimization

# Data retention configuration
retention_policies:
  high_resolution:    # 1-minute data
    duration: 24h
  medium_resolution:  # 5-minute data
    duration: 7d
  low_resolution:     # 1-hour data
    duration: 90d
  archive:           # Daily summaries
    duration: 1y

# Compression settings
compression:
  algorithm: lz4
  level: 1
  batch_size: 1000

Capacity Planning

Growth Trend Analysis

Capacity Metrics

  • Host Growth: Rate of new host additions
  • Data Volume: Database and storage growth rates
  • User Growth: Number of active users and sessions
  • Performance Trends: Response times and throughput changes
  • Resource Utilization: CPU, memory, and storage trends

Capacity Analysis Script

#!/bin/bash
# Capacity analysis and reporting

REPORT_FILE="/opt/sysmanage/reports/capacity_analysis.txt"

echo "=== SysManage Capacity Analysis - $(date) ===" > $REPORT_FILE

# Database size growth
echo "Database Growth:" >> $REPORT_FILE
psql -U sysmanage -d sysmanage -c "
    SELECT
        date_trunc('month', created_at) as month,
        COUNT(*) as new_hosts,
        pg_size_pretty(SUM(pg_column_size(ROW(*)))) as data_size
    FROM hosts
    WHERE created_at > NOW() - INTERVAL '12 months'
    GROUP BY date_trunc('month', created_at)
    ORDER BY month;
" >> $REPORT_FILE

# Storage utilization
echo -e "\nStorage Utilization:" >> $REPORT_FILE
df -h /opt/sysmanage >> $REPORT_FILE

# Memory usage trends
echo -e "\nMemory Usage:" >> $REPORT_FILE
free -h >> $REPORT_FILE

# Performance metrics
echo -e "\nPerformance Metrics:" >> $REPORT_FILE
psql -U sysmanage -d sysmanage -c "
    SELECT
        AVG(response_time) as avg_response_time,
        MAX(response_time) as max_response_time,
        COUNT(*) as total_requests
    FROM api_logs
    WHERE timestamp > NOW() - INTERVAL '24 hours';
" >> $REPORT_FILE

echo "Capacity analysis completed: $REPORT_FILE"

Scaling Strategies

Vertical Scaling (Scale Up)

  • CPU Upgrade: Increase CPU cores for better processing
  • Memory Expansion: Add RAM for better caching and performance
  • Storage Upgrade: Faster SSD storage for database performance
  • Network Upgrade: Higher bandwidth for large deployments

Horizontal Scaling (Scale Out)

  • Load Balancing: Distribute traffic across multiple servers
  • Database Clustering: PostgreSQL read replicas and clustering
  • Microservices: Split application into smaller, scalable services
  • Caching Layer: Distributed caching with Redis cluster

Scaling Thresholds

  • Host Count: Scale when approaching 1000+ hosts per server
  • CPU Usage: Scale when average CPU > 70% for extended periods
  • Memory Usage: Scale when memory > 80% consistently
  • Storage: Scale when storage > 85% full
  • Response Time: Scale when response times > 2 seconds

Log Management

Log Rotation and Archival

Log Rotation Configuration

# /etc/logrotate.d/sysmanage

/opt/sysmanage/logs/*.log {
    daily
    missingok
    rotate 30
    compress
    delaycompress
    notifempty
    create 644 sysmanage sysmanage
    postrotate
        systemctl reload sysmanage-backend
    endscript
}

/var/log/postgresql/*.log {
    weekly
    missingok
    rotate 12
    compress
    delaycompress
    notifempty
    create 640 postgres postgres
}

Log Cleanup Script

#!/bin/bash
# Log cleanup and archival

LOG_DIR="/opt/sysmanage/logs"
ARCHIVE_DIR="/opt/sysmanage/archive"
RETENTION_DAYS=90

# Create archive directory
mkdir -p $ARCHIVE_DIR

# Archive logs older than 30 days
find $LOG_DIR -name "*.log.*" -mtime +30 -exec mv {} $ARCHIVE_DIR/ \;

# Compress archived logs
find $ARCHIVE_DIR -name "*.log.*" ! -name "*.gz" -exec gzip {} \;

# Remove logs older than retention period
find $ARCHIVE_DIR -name "*.gz" -mtime +$RETENTION_DAYS -delete

# Clean up empty directories
find $ARCHIVE_DIR -type d -empty -delete

echo "Log cleanup completed"

Log Analysis and Monitoring

Error Detection Script

#!/bin/bash
# Automated log analysis for errors

LOG_FILE="/opt/sysmanage/logs/backend.log"
ERROR_REPORT="/tmp/error_report.txt"
YESTERDAY=$(date -d "yesterday" +%Y-%m-%d)

# Search for errors in yesterday's logs
grep "$YESTERDAY" $LOG_FILE | grep -i "error\|exception\|critical" > $ERROR_REPORT

if [ -s $ERROR_REPORT ]; then
    ERROR_COUNT=$(wc -l < $ERROR_REPORT)
    echo "Found $ERROR_COUNT errors in yesterday's logs"

    # Send error report
    mail -s "SysManage Error Report - $YESTERDAY" admin@example.com < $ERROR_REPORT
fi

# Clean up
rm -f $ERROR_REPORT

Troubleshooting Procedures

Common Issues and Solutions

High CPU Usage

Diagnosis:
# Check top processes
top -p $(pgrep -d',' sysmanage)

# Check CPU usage by process
ps aux | grep sysmanage | sort -k3 -nr

# Monitor system load
uptime
iostat 1 5
Solutions:
  • Optimize database queries
  • Reduce data collection frequency
  • Scale up CPU resources
  • Implement caching

Memory Issues

Diagnosis:
# Check memory usage
free -h
cat /proc/meminfo

# Check for memory leaks
ps aux --sort=-%mem | head -20

# Monitor memory over time
sar -r 1 10
Solutions:
  • Restart affected services
  • Optimize data structures
  • Implement pagination
  • Add more RAM

Database Performance Issues

Diagnosis:
-- Check slow queries
SELECT query, mean_time, calls
FROM pg_stat_statements
ORDER BY mean_time DESC;

-- Check connection count
SELECT count(*) FROM pg_stat_activity;

-- Check locks
SELECT * FROM pg_locks
WHERE NOT granted;
Solutions:
  • Optimize slow queries
  • Add missing indexes
  • Run VACUUM ANALYZE
  • Increase connection pool size

Diagnostic Tools and Commands

System Diagnostics

# System information
uname -a
lscpu
free -h
df -h

# Process monitoring
ps aux | grep sysmanage
netstat -tulpn | grep :8443
ss -tulpn | grep :5432

# Performance monitoring
vmstat 1 5
iostat -x 1 5
sar -u 1 5

Application Diagnostics

# Check service status
systemctl status sysmanage-backend
systemctl status sysmanage-worker
systemctl status postgresql

# View recent logs
journalctl -u sysmanage-backend -f
tail -f /opt/sysmanage/logs/backend.log

# Test connectivity
curl -k https://localhost:8443/api/health
pg_isready -h localhost -p 5432

Maintenance Best Practices

Planning and Scheduling

  • Maintenance Windows: Schedule during low-traffic periods
  • Change Management: Document all changes and approvals
  • Communication: Notify stakeholders of maintenance schedules
  • Rollback Plans: Always have rollback procedures ready
  • Testing: Test changes in staging before production

Execution Best Practices

  • Backup First: Always backup before making changes
  • Incremental Changes: Make small, manageable changes
  • Monitoring: Monitor systems during and after maintenance
  • Documentation: Document all procedures and results
  • Verification: Verify changes work as expected

Automation Best Practices

  • Script Testing: Test automation scripts thoroughly
  • Error Handling: Implement proper error handling and logging
  • Monitoring: Monitor automated processes
  • Alerting: Alert on automation failures
  • Regular Review: Review and update automation regularly