Maintenance
Regular maintenance procedures, system updates, performance optimization, and capacity planning to ensure optimal SysManage performance and reliability.
Overview
Regular maintenance is essential for keeping SysManage running optimally and preventing issues before they impact operations. This includes routine system updates, performance monitoring, database maintenance, log management, and capacity planning to ensure the system scales effectively with your infrastructure growth.
Types of Maintenance
- Preventive Maintenance: Scheduled tasks to prevent problems
- Corrective Maintenance: Fixes for identified issues
- Adaptive Maintenance: Updates for changing requirements
- Perfective Maintenance: Performance and efficiency improvements
- Emergency Maintenance: Critical issues requiring immediate attention
Maintenance Schedules
Daily Maintenance Tasks
Automated Daily Tasks
- System Health Check: Verify all services are running
- Database Backup: Automated daily backup verification
- Log Rotation: Rotate and compress log files
- Disk Space Monitoring: Check available disk space
- Certificate Monitoring: Check SSL certificate expiration
- Performance Metrics: Collect and analyze performance data
Daily Health Check Script
#!/bin/bash
# Daily SysManage health check
LOG_FILE="/var/log/sysmanage_health_check.log"
DATE=$(date)
echo "=== SysManage Health Check - $DATE ===" >> $LOG_FILE
# Check service status
services=("sysmanage-backend" "sysmanage-worker" "postgresql" "nginx")
for service in "${services[@]}"; do
if systemctl is-active --quiet $service; then
echo "✓ $service is running" >> $LOG_FILE
else
echo "✗ $service is not running" >> $LOG_FILE
# Send alert
mail -s "Service Down: $service" admin@example.com <<< "Service $service is not running on $(hostname)"
fi
done
# Check database connectivity
if pg_isready -h localhost -p 5432 -U sysmanage; then
echo "✓ Database is accessible" >> $LOG_FILE
else
echo "✗ Database is not accessible" >> $LOG_FILE
mail -s "Database Connection Failed" admin@example.com <<< "Database connection failed on $(hostname)"
fi
# Check disk space
df -h | awk '$5 > 85 {print "✗ Disk space warning: " $0}' >> $LOG_FILE
# Check memory usage
memory_usage=$(free | awk 'NR==2{printf "%.2f", $3*100/$2}')
if (( $(echo "$memory_usage > 90" | bc -l) )); then
echo "✗ Memory usage high: ${memory_usage}%" >> $LOG_FILE
fi
echo "=== Health Check Complete ===" >> $LOG_FILE
Weekly Maintenance Tasks
Scheduled Weekly Tasks
- System Updates: Apply security patches and updates
- Database Maintenance: VACUUM, ANALYZE, and index optimization
- Log Analysis: Review system and application logs
- Performance Review: Analyze performance trends
- Backup Verification: Test backup integrity
- Security Audit: Review security logs and access patterns
Database Maintenance Script
#!/bin/bash
# Weekly database maintenance
DB_NAME="sysmanage"
MAINTENANCE_LOG="/var/log/db_maintenance.log"
echo "$(date): Starting database maintenance" >> $MAINTENANCE_LOG
# Vacuum and analyze all tables
psql -U sysmanage -d $DB_NAME -c "VACUUM ANALYZE;" >> $MAINTENANCE_LOG 2>&1
# Reindex tables if needed
psql -U sysmanage -d $DB_NAME -c "REINDEX DATABASE $DB_NAME;" >> $MAINTENANCE_LOG 2>&1
# Update table statistics
psql -U sysmanage -d $DB_NAME -c "ANALYZE;" >> $MAINTENANCE_LOG 2>&1
# Check for bloated tables
psql -U sysmanage -d $DB_NAME -c "
SELECT schemaname, tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;
" >> $MAINTENANCE_LOG 2>&1
echo "$(date): Database maintenance completed" >> $MAINTENANCE_LOG
Monthly Maintenance Tasks
Comprehensive Monthly Tasks
- System Audit: Complete system configuration review
- Capacity Planning: Analyze growth trends and capacity needs
- Performance Optimization: Identify and address performance bottlenecks
- Security Review: Comprehensive security assessment
- Documentation Update: Update system documentation
- Disaster Recovery Test: Test backup and recovery procedures
- User Access Review: Audit user accounts and permissions
Monthly Report Generation
#!/bin/bash
# Generate monthly maintenance report
REPORT_DIR="/opt/sysmanage/reports"
MONTH=$(date +%Y-%m)
REPORT_FILE="$REPORT_DIR/monthly_report_$MONTH.html"
mkdir -p $REPORT_DIR
cat > $REPORT_FILE << EOF
SysManage Monthly Report - $MONTH
SysManage Monthly Report - $MONTH
Generated on: $(date)
System Overview
Total Hosts: $(psql -U sysmanage -d sysmanage -t -c "SELECT COUNT(*) FROM hosts WHERE active=true;")
Active Users: $(psql -U sysmanage -d sysmanage -t -c "SELECT COUNT(*) FROM users WHERE active=true;")
Database Size: $(psql -U sysmanage -d sysmanage -t -c "SELECT pg_size_pretty(pg_database_size('sysmanage'));")
Performance Metrics
Metric Average Peak Status
EOF
# Add performance data to report
echo "
" >> $REPORT_FILE
echo "" >> $REPORT_FILE
echo "Monthly report generated: $REPORT_FILE"
System Updates and Patches
Update Strategy
Update Categories
- Security Updates: Critical security patches (immediate)
- Bug Fixes: Application and system bug fixes (weekly)
- Feature Updates: New features and improvements (monthly)
- Major Releases: Major version upgrades (quarterly)
Update Process
- Review available updates and release notes
- Test updates in staging environment
- Schedule maintenance window
- Create system backup before updates
- Apply updates in controlled manner
- Verify system functionality post-update
- Monitor for any issues or regressions
- Document update process and results
Update Automation Script
#!/bin/bash
# System update automation
UPDATE_LOG="/var/log/system_updates.log"
BACKUP_DIR="/opt/sysmanage/backups/pre_update"
echo "$(date): Starting system updates" >> $UPDATE_LOG
# Create pre-update backup
mkdir -p $BACKUP_DIR
DATE=$(date +%Y%m%d_%H%M%S)
# Backup configuration
tar -czf "$BACKUP_DIR/config_backup_$DATE.tar.gz" /opt/sysmanage/config/
# Update system packages
apt update >> $UPDATE_LOG 2>&1
DEBIAN_FRONTEND=noninteractive apt upgrade -y >> $UPDATE_LOG 2>&1
# Update SysManage application
if [ -f "/opt/sysmanage/scripts/update.sh" ]; then
/opt/sysmanage/scripts/update.sh >> $UPDATE_LOG 2>&1
fi
# Restart services if needed
if [ -f "/var/run/reboot-required" ]; then
echo "$(date): System reboot required after updates" >> $UPDATE_LOG
# Schedule reboot or notify administrators
mail -s "Reboot Required After Updates" admin@example.com <<< "System reboot required on $(hostname) after applying updates"
fi
echo "$(date): System updates completed" >> $UPDATE_LOG
Rollback Procedures
Application Rollback
#!/bin/bash
# Application rollback script
BACKUP_VERSION="$1"
APP_DIR="/opt/sysmanage"
BACKUP_DIR="/opt/sysmanage/backups"
if [ -z "$BACKUP_VERSION" ]; then
echo "Usage: $0 "
echo "Available backups:"
ls -la $BACKUP_DIR/app_backup_*
exit 1
fi
# Stop services
systemctl stop sysmanage-backend
systemctl stop sysmanage-worker
# Backup current version
cp -r $APP_DIR "$BACKUP_DIR/rollback_backup_$(date +%Y%m%d_%H%M%S)"
# Restore previous version
tar -xzf "$BACKUP_DIR/app_backup_$BACKUP_VERSION.tar.gz" -C /opt/
# Restore configuration if needed
if [ -f "$BACKUP_DIR/config_backup_$BACKUP_VERSION.tar.gz" ]; then
tar -xzf "$BACKUP_DIR/config_backup_$BACKUP_VERSION.tar.gz" -C /
fi
# Restart services
systemctl start sysmanage-backend
systemctl start sysmanage-worker
echo "Rollback to version $BACKUP_VERSION completed"
Performance Optimization
Database Performance Optimization
Query Performance Analysis
-- Identify slow queries
SELECT query, mean_time, calls, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
-- Find missing indexes
SELECT schemaname, tablename, attname, n_distinct, correlation
FROM pg_stats
WHERE schemaname = 'public'
AND n_distinct > 100
AND correlation < 0.1;
-- Check table bloat
SELECT schemaname, tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size,
pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) as table_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
Index Optimization
-- Create indexes for common queries
CREATE INDEX CONCURRENTLY idx_metrics_timestamp
ON metrics (timestamp)
WHERE timestamp > NOW() - INTERVAL '30 days';
CREATE INDEX CONCURRENTLY idx_hosts_active
ON hosts (active, hostname)
WHERE active = true;
CREATE INDEX CONCURRENTLY idx_alerts_severity
ON alerts (severity, created_at)
WHERE resolved_at IS NULL;
Connection Pool Optimization
# PostgreSQL connection settings (postgresql.conf)
max_connections = 200
shared_buffers = 256MB
effective_cache_size = 1GB
work_mem = 4MB
maintenance_work_mem = 64MB
# Connection pooling with pgbouncer
[databases]
sysmanage = host=localhost port=5432 dbname=sysmanage
[pgbouncer]
listen_port = 6432
listen_addr = 127.0.0.1
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
default_pool_size = 25
max_client_conn = 200
reserve_pool_size = 5
Application Performance Optimization
Caching Strategy
# Redis configuration for caching
maxmemory 512mb
maxmemory-policy allkeys-lru
# Application caching configuration
CACHE_BACKEND = 'redis://localhost:6379/0'
CACHE_TIMEOUT = 300 # 5 minutes
CACHE_KEY_PREFIX = 'sysmanage:'
# Cache frequently accessed data
- Host information: 15 minutes
- User sessions: 1 hour
- System metrics: 5 minutes
- Configuration data: 30 minutes
Resource Optimization
- Memory Management: Monitor memory usage and implement pagination
- CPU Optimization: Profile CPU usage and optimize algorithms
- I/O Optimization: Minimize disk I/O and use asynchronous operations
- Network Optimization: Implement compression and connection pooling
Monitoring System Optimization
Data Collection Optimization
- Collection Intervals: Adjust based on criticality and change frequency
- Metric Granularity: Balance detail with storage requirements
- Batch Processing: Collect multiple metrics in single operations
- Data Compression: Compress data before storage and transmission
Storage Optimization
# Data retention configuration
retention_policies:
high_resolution: # 1-minute data
duration: 24h
medium_resolution: # 5-minute data
duration: 7d
low_resolution: # 1-hour data
duration: 90d
archive: # Daily summaries
duration: 1y
# Compression settings
compression:
algorithm: lz4
level: 1
batch_size: 1000
Capacity Planning
Growth Trend Analysis
Capacity Metrics
- Host Growth: Rate of new host additions
- Data Volume: Database and storage growth rates
- User Growth: Number of active users and sessions
- Performance Trends: Response times and throughput changes
- Resource Utilization: CPU, memory, and storage trends
Capacity Analysis Script
#!/bin/bash
# Capacity analysis and reporting
REPORT_FILE="/opt/sysmanage/reports/capacity_analysis.txt"
echo "=== SysManage Capacity Analysis - $(date) ===" > $REPORT_FILE
# Database size growth
echo "Database Growth:" >> $REPORT_FILE
psql -U sysmanage -d sysmanage -c "
SELECT
date_trunc('month', created_at) as month,
COUNT(*) as new_hosts,
pg_size_pretty(SUM(pg_column_size(ROW(*)))) as data_size
FROM hosts
WHERE created_at > NOW() - INTERVAL '12 months'
GROUP BY date_trunc('month', created_at)
ORDER BY month;
" >> $REPORT_FILE
# Storage utilization
echo -e "\nStorage Utilization:" >> $REPORT_FILE
df -h /opt/sysmanage >> $REPORT_FILE
# Memory usage trends
echo -e "\nMemory Usage:" >> $REPORT_FILE
free -h >> $REPORT_FILE
# Performance metrics
echo -e "\nPerformance Metrics:" >> $REPORT_FILE
psql -U sysmanage -d sysmanage -c "
SELECT
AVG(response_time) as avg_response_time,
MAX(response_time) as max_response_time,
COUNT(*) as total_requests
FROM api_logs
WHERE timestamp > NOW() - INTERVAL '24 hours';
" >> $REPORT_FILE
echo "Capacity analysis completed: $REPORT_FILE"
Scaling Strategies
Vertical Scaling (Scale Up)
- CPU Upgrade: Increase CPU cores for better processing
- Memory Expansion: Add RAM for better caching and performance
- Storage Upgrade: Faster SSD storage for database performance
- Network Upgrade: Higher bandwidth for large deployments
Horizontal Scaling (Scale Out)
- Load Balancing: Distribute traffic across multiple servers
- Database Clustering: PostgreSQL read replicas and clustering
- Microservices: Split application into smaller, scalable services
- Caching Layer: Distributed caching with Redis cluster
Scaling Thresholds
- Host Count: Scale when approaching 1000+ hosts per server
- CPU Usage: Scale when average CPU > 70% for extended periods
- Memory Usage: Scale when memory > 80% consistently
- Storage: Scale when storage > 85% full
- Response Time: Scale when response times > 2 seconds
Log Management
Log Rotation and Archival
Log Rotation Configuration
# /etc/logrotate.d/sysmanage
/opt/sysmanage/logs/*.log {
daily
missingok
rotate 30
compress
delaycompress
notifempty
create 644 sysmanage sysmanage
postrotate
systemctl reload sysmanage-backend
endscript
}
/var/log/postgresql/*.log {
weekly
missingok
rotate 12
compress
delaycompress
notifempty
create 640 postgres postgres
}
Log Cleanup Script
#!/bin/bash
# Log cleanup and archival
LOG_DIR="/opt/sysmanage/logs"
ARCHIVE_DIR="/opt/sysmanage/archive"
RETENTION_DAYS=90
# Create archive directory
mkdir -p $ARCHIVE_DIR
# Archive logs older than 30 days
find $LOG_DIR -name "*.log.*" -mtime +30 -exec mv {} $ARCHIVE_DIR/ \;
# Compress archived logs
find $ARCHIVE_DIR -name "*.log.*" ! -name "*.gz" -exec gzip {} \;
# Remove logs older than retention period
find $ARCHIVE_DIR -name "*.gz" -mtime +$RETENTION_DAYS -delete
# Clean up empty directories
find $ARCHIVE_DIR -type d -empty -delete
echo "Log cleanup completed"
Log Analysis and Monitoring
Error Detection Script
#!/bin/bash
# Automated log analysis for errors
LOG_FILE="/opt/sysmanage/logs/backend.log"
ERROR_REPORT="/tmp/error_report.txt"
YESTERDAY=$(date -d "yesterday" +%Y-%m-%d)
# Search for errors in yesterday's logs
grep "$YESTERDAY" $LOG_FILE | grep -i "error\|exception\|critical" > $ERROR_REPORT
if [ -s $ERROR_REPORT ]; then
ERROR_COUNT=$(wc -l < $ERROR_REPORT)
echo "Found $ERROR_COUNT errors in yesterday's logs"
# Send error report
mail -s "SysManage Error Report - $YESTERDAY" admin@example.com < $ERROR_REPORT
fi
# Clean up
rm -f $ERROR_REPORT
Troubleshooting Procedures
Common Issues and Solutions
High CPU Usage
Diagnosis:
# Check top processes
top -p $(pgrep -d',' sysmanage)
# Check CPU usage by process
ps aux | grep sysmanage | sort -k3 -nr
# Monitor system load
uptime
iostat 1 5
Solutions:
- Optimize database queries
- Reduce data collection frequency
- Scale up CPU resources
- Implement caching
Memory Issues
Diagnosis:
# Check memory usage
free -h
cat /proc/meminfo
# Check for memory leaks
ps aux --sort=-%mem | head -20
# Monitor memory over time
sar -r 1 10
Solutions:
- Restart affected services
- Optimize data structures
- Implement pagination
- Add more RAM
Database Performance Issues
Diagnosis:
-- Check slow queries
SELECT query, mean_time, calls
FROM pg_stat_statements
ORDER BY mean_time DESC;
-- Check connection count
SELECT count(*) FROM pg_stat_activity;
-- Check locks
SELECT * FROM pg_locks
WHERE NOT granted;
Solutions:
- Optimize slow queries
- Add missing indexes
- Run VACUUM ANALYZE
- Increase connection pool size
Diagnostic Tools and Commands
System Diagnostics
# System information
uname -a
lscpu
free -h
df -h
# Process monitoring
ps aux | grep sysmanage
netstat -tulpn | grep :8443
ss -tulpn | grep :5432
# Performance monitoring
vmstat 1 5
iostat -x 1 5
sar -u 1 5
Application Diagnostics
# Check service status
systemctl status sysmanage-backend
systemctl status sysmanage-worker
systemctl status postgresql
# View recent logs
journalctl -u sysmanage-backend -f
tail -f /opt/sysmanage/logs/backend.log
# Test connectivity
curl -k https://localhost:8443/api/health
pg_isready -h localhost -p 5432
Maintenance Best Practices
Planning and Scheduling
- Maintenance Windows: Schedule during low-traffic periods
- Change Management: Document all changes and approvals
- Communication: Notify stakeholders of maintenance schedules
- Rollback Plans: Always have rollback procedures ready
- Testing: Test changes in staging before production
Execution Best Practices
- Backup First: Always backup before making changes
- Incremental Changes: Make small, manageable changes
- Monitoring: Monitor systems during and after maintenance
- Documentation: Document all procedures and results
- Verification: Verify changes work as expected
Automation Best Practices
- Script Testing: Test automation scripts thoroughly
- Error Handling: Implement proper error handling and logging
- Monitoring: Monitor automated processes
- Alerting: Alert on automation failures
- Regular Review: Review and update automation regularly