Monitoring & Alerts
Comprehensive monitoring setup including alerting, performance metrics, and system health tracking for proactive infrastructure management.
Overview
SysManage provides comprehensive monitoring capabilities that enable proactive infrastructure management through real-time metrics collection, intelligent alerting, and customizable dashboards. The monitoring system is designed to scale with your infrastructure while maintaining performance and reliability.
Monitoring Capabilities
- OpenTelemetry Integration: Industry-standard observability with distributed tracing and metrics
- Prometheus Metrics: Time-series metrics storage and powerful querying with PromQL
- Grafana Integration: Professional visualization and dashboarding platform
- Real-time Metrics: Continuous collection of system performance data
- Intelligent Alerting: Smart threshold-based and anomaly detection alerts
- Custom Dashboards: Configurable views for different teams and use cases
- Historical Analysis: Long-term trend analysis and capacity planning
- Multi-channel Notifications: Email, Slack, webhooks, and more
- Escalation Management: Automated escalation based on severity and response time
OpenTelemetry & Prometheus Stack
SysManage includes built-in support for industry-standard observability tools, providing comprehensive application and infrastructure monitoring.
Telemetry Architecture
┌─────────────┐ OTLP ┌──────────────────┐ Scrape ┌────────────┐
│ SysManage │──────────────>│ OpenTelemetry │──────────────>│ Prometheus │
│ Backend │ │ Collector │ │ │
└─────────────┘ └──────────────────┘ └────────────┘
│ │ │
│ Prometheus │ Logs/Traces │
│ Metrics │ │
└──────────────────────────────┴───────────────────────────────────┘
│
v
┌────────────┐
│ Grafana │
│ (optional) │
└────────────┘
Quick Start
1. Install Telemetry Stack
The telemetry stack is automatically installed during development setup:
make install-dev
This installs:
- OpenTelemetry Collector Contrib
- Prometheus time-series database
- Python OpenTelemetry instrumentation libraries
2. Start Services
Telemetry services start automatically with SysManage:
make start
This starts in order:
- OpenBAO (secrets management)
- OpenTelemetry Collector (ports 4317, 4318, 8888, 55679)
- Prometheus (port 9091)
- SysManage Backend (with telemetry enabled)
3. Enable Telemetry
Set the environment variable to enable OpenTelemetry instrumentation:
export OTEL_ENABLED=true
./run.sh
Or add to your shell configuration (.bashrc, .zshrc, etc.):
export OTEL_ENABLED=true
4. Verify Telemetry
Check that services are running:
make status-telemetry
Access web interfaces:
- Prometheus UI: http://localhost:9091
- OpenTelemetry Collector zPages: http://localhost:55679/debug/tracez
- SysManage Metrics Endpoint: http://localhost:9090/metrics
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
OTEL_ENABLED |
false |
Enable/disable OpenTelemetry instrumentation |
OTEL_EXPORTER_OTLP_ENDPOINT |
http://localhost:4317 |
OTLP gRPC endpoint for collector |
OTEL_PROMETHEUS_PORT |
9090 |
Port for Prometheus metrics from SysManage |
Configuration Files
config/otel-collector-config.yml- OpenTelemetry Collector configurationconfig/prometheus.yml- Prometheus scrape configuration
Customizing OpenTelemetry Collector
Edit config/otel-collector-config.yml to:
- Add custom processors for data transformation
- Configure additional exporters (Jaeger, Zipkin, etc.)
- Adjust resource attributes and metadata
- Change sampling rates and batch sizes
- Add filters and data enrichment
Customizing Prometheus
Edit config/prometheus.yml to:
- Add additional scrape targets
- Configure alerting rules
- Adjust retention settings
- Set up remote write for long-term storage
- Define custom labels and relabeling
Available Metrics
SysManage exposes comprehensive metrics through OpenTelemetry:
HTTP Metrics
http_server_duration- HTTP request duration histogram (P50, P95, P99)http_server_active_requests- Active HTTP requests gaugehttp_server_request_size- Request body size histogramhttp_server_response_size- Response body size histogramhttp_server_requests_total- Total HTTP request counter
Database Metrics
db_client_operation_duration- Database operation durationdb_client_connections_usage- Connection pool usagedb_client_connections_idle- Idle connections in pooldb_client_connections_max- Maximum pool sizedb_client_operations_total- Total database operations
Application Metrics
process_cpu_seconds_total- Total CPU time consumedprocess_resident_memory_bytes- Resident memory sizeprocess_virtual_memory_bytes- Virtual memory sizeprocess_open_fds- Number of open file descriptorsprocess_start_time_seconds- Process start time
Custom Business Metrics
Add custom metrics using the OpenTelemetry SDK:
from backend.telemetry.otel_config import get_meter
meter = get_meter(__name__)
counter = meter.create_counter(
"hosts_managed_total",
description="Total number of managed hosts",
unit="1"
)
counter.add(1, {"status": "active"})
Distributed Tracing
OpenTelemetry automatically traces:
- FastAPI HTTP requests and responses
- SQLAlchemy database queries
- External HTTP requests (via
requestslibrary) - Background task execution
Custom Spans
Add custom spans for detailed tracing:
from backend.telemetry.otel_config import get_tracer
tracer = get_tracer(__name__)
with tracer.start_as_current_span("process_agent_data") as span:
span.set_attribute("agent.id", agent_id)
span.set_attribute("data.size", len(data))
# Your processing code here
result = process_data(data)
span.set_attribute("result.status", "success")
return result
Viewing Traces
Access trace data through:
- OpenTelemetry Collector zPages: http://localhost:55679/debug/tracez
- Export to Jaeger: Configure OTLP exporter to send to Jaeger
- Export to Zipkin: Configure OTLP exporter to send to Zipkin
- Grafana Tempo: Use Tempo for distributed tracing in Grafana
Grafana Integration
SysManage includes native Grafana integration for professional visualization and dashboarding.
Grafana Setup
Creating API Key in Grafana
Important: The API key must have Admin role to automatically configure Prometheus data sources. Editor role is insufficient for data source management.
- Log into Grafana as an admin user
- Navigate to Configuration > API Keys (or Administration > Service Accounts > API Keys in newer versions)
- Click "Add API key" or "New API Key"
- Configure the key:
- Name:
SysManage Integration - Role:
Admin(required for automatic data source configuration) - Time to live: Set expiration as needed (or leave blank for no expiration)
- Name:
- Click "Add" or "Create"
- Copy the generated API key immediately (it will only be shown once)
- Store the key securely - SysManage will store it encrypted in OpenBAO vault
Configuring Grafana Integration in SysManage
- Navigate to Settings > Integrations in SysManage web UI
- Scroll to the Grafana Integration card
- Toggle "Enable Grafana Integration" to ON
- Choose your configuration method:
- Use Managed Server: Select a Grafana server managed by SysManage (requires "Monitoring Server" role with package "grafana")
- Manual Configuration: Toggle off "Use Managed Server" and enter your Grafana URL (e.g.,
http://grafana.example.com:3000)
- Paste the API key into the "API Key" field
- Click "Save"
- SysManage will automatically:
- Securely store the API key in OpenBAO vault
- Verify connectivity to Grafana
- Create "SysManage Prometheus" data source in Grafana
- Configure the data source to point to your Prometheus instance
- Click "Check Health" to verify the connection
Troubleshooting API Key Issues
- 401 Unauthorized: API key is invalid or has been deleted in Grafana. Generate a new key.
- 403 Permission Denied: API key has insufficient permissions. Ensure it has Admin role, not just Editor.
- Connection timeout: Check that the Grafana URL is correct and the server is reachable from SysManage.
- Bad gateway: If using "localhost" in Grafana URL, change to the actual hostname or IP address that Grafana can reach.
Verifying Prometheus Data Source
Note: When you configure Grafana integration with an Admin-level API key, SysManage automatically creates and configures the Prometheus data source. Manual configuration is only needed if automatic setup fails.
To verify the automatic configuration:
- In Grafana, navigate to Configuration > Data Sources
- Look for "SysManage Prometheus" in the list
- Click on it to view the configuration:
- Name:
SysManage Prometheus - URL:
http://your-sysmanage-host:9091 - Access: Server (proxy)
- Default: Set as default data source
- Name:
- Click "Save & Test" to verify connectivity
- You should see a green success message: "Data source is working"
Manual Prometheus Data Source Setup
If automatic configuration fails, you can manually add the data source:
- In Grafana, navigate to Configuration > Data Sources
- Click "Add data source"
- Select "Prometheus"
- Configure the data source:
- Name:
SysManage Prometheus - URL:
http://your-sysmanage-host:9091(use actual hostname, not "localhost") - Access: Server (default)
- Scrape interval:
15s - HTTP Method: POST
- Name:
- Click "Save & Test"
Recommended Dashboard Panels
Application Performance Dashboard
- Request Rate:
rate(http_server_requests_total[5m]) - Response Time (P95):
histogram_quantile(0.95, http_server_duration_bucket) - Error Rate:
rate(http_server_requests_total{status_code=~"5.."}[5m]) - Active Requests:
http_server_active_requests
Database Performance Dashboard
- Query Duration:
histogram_quantile(0.95, db_client_operation_duration_bucket) - Connection Pool Usage:
db_client_connections_usage / db_client_connections_max - Idle Connections:
db_client_connections_idle - Operations Rate:
rate(db_client_operations_total[5m])
System Resources Dashboard
- CPU Usage:
rate(process_cpu_seconds_total[1m]) - Memory Usage:
process_resident_memory_bytes - Open File Descriptors:
process_open_fds - Uptime:
time() - process_start_time_seconds
Health Check
SysManage provides an API endpoint to verify Grafana connectivity:
GET /api/grafana/health
This endpoint checks:
- Grafana server reachability
- API key validity
- Grafana version information
- Build information and features
Metrics Collection
System Performance Metrics
SysManage automatically collects comprehensive system metrics:
CPU Metrics
- Utilization: Overall CPU usage percentage
- Load Average: 1, 5, and 15-minute load averages
- Per-Core Usage: Individual CPU core utilization
- Process Breakdown: Top CPU-consuming processes
- Context Switches: System context switch rates
- Interrupts: Hardware and software interrupt rates
Memory Metrics
- Total/Used/Free: Memory utilization breakdown
- Swap Usage: Swap space utilization and activity
- Buffer/Cache: System buffer and cache usage
- Memory Pressure: Memory pressure indicators
- Process Memory: Per-process memory consumption
- Memory Leaks: Detection of potential memory leaks
Storage Metrics
- Disk Usage: File system capacity and utilization
- I/O Performance: Read/write throughput and latency
- IOPS: Input/output operations per second
- Queue Depth: Storage queue metrics
- Disk Health: SMART data and disk health indicators
- Mount Points: All mounted file systems status
Network Metrics
- Interface Statistics: Traffic, errors, and drops per interface
- Bandwidth Utilization: Network throughput and capacity
- Connection Statistics: Active connections and states
- Protocol Statistics: TCP, UDP, and other protocol metrics
- DNS Performance: DNS resolution times and failures
- Network Latency: Network round-trip times
Application and Service Metrics
Service Monitoring
- Service Status: Running, stopped, failed service states
- Process Monitoring: Critical process health and performance
- Port Monitoring: Service availability on specific ports
- Log Analysis: Application log patterns and errors
- Resource Usage: Per-service resource consumption
- Response Times: Application response time metrics
Database Monitoring
- Connection Pools: Database connection utilization
- Query Performance: Slow query detection and analysis
- Lock Statistics: Database lock contention metrics
- Replication Status: Database replication health
- Storage Growth: Database size and growth trends
Web Server Monitoring
- Request Rates: HTTP request volume and patterns
- Response Codes: HTTP status code distribution
- Response Times: Request processing latency
- Connection Metrics: Active connections and limits
- SSL Certificate: Certificate expiration monitoring
Custom Metrics
Extend monitoring with custom application metrics:
Custom Script Monitoring
- Script Execution: Run custom monitoring scripts
- Exit Code Monitoring: Alert on script failures
- Output Parsing: Extract metrics from script output
- Scheduled Execution: Define custom execution schedules
- Timeout Handling: Handle long-running or stuck scripts
API Integration
- REST API Calls: Monitor external API availability
- Response Validation: Validate API response content
- Authentication: Support for various authentication methods
- Rate Limiting: Respect API rate limits
- Custom Headers: Include custom headers in requests
Alerting System
Alert Configuration
Alert Types
- Threshold Alerts: Trigger when metrics exceed defined thresholds
- Anomaly Detection: Machine learning-based anomaly detection
- Service Alerts: Alert on service availability and health
- Log Pattern Alerts: Trigger on specific log patterns
- Composite Alerts: Multi-condition alerts with logical operators
- Predictive Alerts: Early warning based on trend analysis
Threshold Configuration
Configure alert thresholds for optimal balance between noise and coverage:
CPU Utilization Example
- Warning: CPU > 70% for 5 minutes
- Critical: CPU > 90% for 2 minutes
- Recovery: CPU < 60% for 3 minutes
Memory Usage Example
- Warning: Memory > 80% for 10 minutes
- Critical: Memory > 95% for 1 minute
- Recovery: Memory < 75% for 5 minutes
Disk Space Example
- Warning: Disk > 85% usage
- Critical: Disk > 95% usage
- Recovery: Disk < 80% usage
Notification Channels
Supported Channels
- Email: SMTP-based email notifications
- Slack: Team collaboration via Slack channels
- Microsoft Teams: Integration with Teams channels
- Discord: Community and team notifications
- Webhooks: Custom HTTP POST notifications
- SMS: Text message alerts via third-party services
- PagerDuty: Professional incident management
- OpsGenie: Enterprise alerting and escalation
Channel Configuration
Email Configuration
SMTP Settings:
Server: mail.example.com
Port: 587
Security: STARTTLS
Authentication: username/password
Slack Integration
Webhook URL: https://hooks.slack.com/services/...
Channel: #infrastructure-alerts
Username: SysManage
Icon: :warning:
Webhook Configuration
URL: https://api.example.com/alerts
Method: POST
Headers:
Content-Type: application/json
Authorization: Bearer <token>
Escalation Management
Escalation Policies
Define escalation paths for different alert severities:
Level 1 Escalation (0-15 minutes)
- Primary on-call engineer
- Slack channel notification
- Email to team distribution list
Level 2 Escalation (15-30 minutes)
- Secondary on-call engineer
- Team lead notification
- SMS to primary on-call
Level 3 Escalation (30+ minutes)
- Manager notification
- Emergency contact activation
- Multiple communication channels
Alert Suppression
- Maintenance Windows: Suppress alerts during planned maintenance
- Dependency Mapping: Suppress child alerts when parent systems fail
- Frequency Limiting: Limit notification frequency for noisy alerts
- Business Hours: Different escalation during business vs. off hours
Dashboard Configuration
Dashboard Types
Executive Dashboards
- High-level KPIs: Overall system health and availability
- Business Metrics: Service uptime and performance
- Trend Analysis: Month-over-month performance trends
- Cost Optimization: Resource utilization and efficiency
Operations Dashboards
- Real-time Status: Current system status and alerts
- Performance Metrics: Detailed performance indicators
- Capacity Planning: Resource utilization and forecasting
- Incident Tracking: Active incidents and resolution status
Application Dashboards
- Application Performance: Response times and throughput
- Error Tracking: Application errors and failure rates
- User Experience: End-user performance metrics
- Dependency Mapping: Service dependencies and health
Infrastructure Dashboards
- System Health: CPU, memory, disk, and network status
- Hardware Monitoring: Physical hardware health and performance
- Network Overview: Network topology and performance
- Security Monitoring: Security events and compliance status
Dashboard Customization
Widget Types
- Time Series Graphs: Line charts showing metrics over time
- Gauge Widgets: Current value displays with thresholds
- Status Indicators: Health status and availability displays
- Top Lists: Ranked lists of hosts or metrics
- Heat Maps: Visual representation of metric distributions
- Alert Summaries: Current and recent alert status
Layout Options
- Grid Layout: Flexible grid-based widget arrangement
- Responsive Design: Adapts to different screen sizes
- Custom Sizing: Adjustable widget dimensions
- Multi-page Dashboards: Organize widgets across multiple pages
- Auto-refresh: Configurable refresh intervals
Access Control
- Role-based Access: Different dashboards for different roles
- Team Dashboards: Team-specific metric views
- Public Dashboards: Read-only dashboards for stakeholders
- Edit Permissions: Control who can modify dashboards
Health Check Configuration
System Health Checks
Core System Checks
- Agent Connectivity: Verify all agents are connected and responsive
- Database Health: Check database connectivity and performance
- Queue Status: Monitor message queue health and processing
- Certificate Validity: Check SSL/TLS certificate expiration
- Disk Space: Verify adequate disk space on server
- Memory Usage: Monitor server memory utilization
Service Health Checks
- Web Server: HTTP response and performance checks
- API Endpoints: Verify API functionality and response times
- Background Services: Check background job processing
- External Dependencies: Monitor external service availability
- Load Balancer: Verify load balancer health and distribution
Custom Health Checks
Script-based Checks
#!/bin/bash
# Custom health check example
check_application_health() {
response=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health)
if [ "$response" -eq 200 ]; then
echo "Application healthy"
exit 0
else
echo "Application unhealthy - HTTP $response"
exit 1
fi
}
Database Connectivity Check
#!/bin/bash
# Database health check
check_database() {
if pg_isready -h localhost -p 5432 -U sysmanage; then
echo "Database accessible"
exit 0
else
echo "Database not accessible"
exit 1
fi
}
API Health Check
#!/usr/bin/env python3
import requests
import sys
def check_api_health():
try:
response = requests.get('http://localhost:8443/api/health', timeout=10)
if response.status_code == 200:
print("API healthy")
sys.exit(0)
else:
print(f"API unhealthy - Status: {response.status_code}")
sys.exit(1)
except Exception as e:
print(f"API check failed: {e}")
sys.exit(1)
Monitoring Performance Optimization
Data Retention Policies
Retention Strategies
- High-resolution data: 24 hours of 1-minute intervals
- Medium-resolution data: 7 days of 5-minute intervals
- Low-resolution data: 90 days of 1-hour intervals
- Archive data: 1 year of daily summaries
- Long-term trends: Multi-year monthly summaries
Storage Optimization
- Data Compression: Compress historical data
- Archival: Move old data to cold storage
- Purging: Automatically delete expired data
- Indexing: Optimize database indexes for queries
Collection Optimization
Collection Intervals
- Critical Metrics: 30-second intervals
- Standard Metrics: 1-minute intervals
- Resource Intensive: 5-minute intervals
- Static Data: Hourly or daily collection
Agent Performance
- Batch Collection: Collect multiple metrics efficiently
- Compression: Compress data before transmission
- Buffering: Buffer data during network outages
- Priority Queues: Prioritize critical metrics
Monitoring Best Practices
Alert Design Best Practices
- Actionable Alerts: Only alert on conditions requiring action
- Appropriate Thresholds: Set thresholds based on historical data
- Context Information: Include relevant context in alerts
- Clear Recovery: Define clear recovery conditions
- Avoid Alert Fatigue: Minimize false positives
- Business Impact: Correlate alerts with business impact
Dashboard Design Best Practices
- Audience-specific: Design dashboards for specific audiences
- Information Hierarchy: Most critical information first
- Visual Clarity: Use clear, readable visualizations
- Performance Focus: Optimize dashboard loading times
- Regular Review: Regularly review and update dashboards
Operational Best Practices
- Regular Maintenance: Schedule regular monitoring system maintenance
- Capacity Planning: Monitor and plan for system growth
- Documentation: Document alert procedures and escalation paths
- Training: Train team members on monitoring tools and procedures
- Continuous Improvement: Regularly review and improve monitoring setup