Documentation > Administration > Monitoring & Alerts

Monitoring & Alerts

Comprehensive monitoring setup including alerting, performance metrics, and system health tracking for proactive infrastructure management.

Overview

SysManage provides comprehensive monitoring capabilities that enable proactive infrastructure management through real-time metrics collection, intelligent alerting, and customizable dashboards. The monitoring system is designed to scale with your infrastructure while maintaining performance and reliability.

Monitoring Capabilities

  • OpenTelemetry Integration: Industry-standard observability with distributed tracing and metrics
  • Prometheus Metrics: Time-series metrics storage and powerful querying with PromQL
  • Grafana Integration: Professional visualization and dashboarding platform
  • Real-time Metrics: Continuous collection of system performance data
  • Intelligent Alerting: Smart threshold-based and anomaly detection alerts
  • Custom Dashboards: Configurable views for different teams and use cases
  • Historical Analysis: Long-term trend analysis and capacity planning
  • Multi-channel Notifications: Email, Slack, webhooks, and more
  • Escalation Management: Automated escalation based on severity and response time

OpenTelemetry & Prometheus Stack

SysManage includes built-in support for industry-standard observability tools, providing comprehensive application and infrastructure monitoring.

Telemetry Architecture

┌─────────────┐     OTLP      ┌──────────────────┐     Scrape    ┌────────────┐
│  SysManage  │──────────────>│  OpenTelemetry   │──────────────>│ Prometheus │
│   Backend   │               │    Collector     │               │            │
└─────────────┘               └──────────────────┘               └────────────┘
      │                              │                                   │
      │ Prometheus                   │ Logs/Traces                       │
      │ Metrics                      │                                   │
      └──────────────────────────────┴───────────────────────────────────┘
                                     │
                                     v
                              ┌────────────┐
                              │  Grafana   │
                              │ (optional) │
                              └────────────┘

Quick Start

1. Install Telemetry Stack

The telemetry stack is automatically installed during development setup:

make install-dev

This installs:

  • OpenTelemetry Collector Contrib
  • Prometheus time-series database
  • Python OpenTelemetry instrumentation libraries

2. Start Services

Telemetry services start automatically with SysManage:

make start

This starts in order:

  • OpenBAO (secrets management)
  • OpenTelemetry Collector (ports 4317, 4318, 8888, 55679)
  • Prometheus (port 9091)
  • SysManage Backend (with telemetry enabled)

3. Enable Telemetry

Set the environment variable to enable OpenTelemetry instrumentation:

export OTEL_ENABLED=true
./run.sh

Or add to your shell configuration (.bashrc, .zshrc, etc.):

export OTEL_ENABLED=true

4. Verify Telemetry

Check that services are running:

make status-telemetry

Access web interfaces:

Configuration

Environment Variables

Variable Default Description
OTEL_ENABLED false Enable/disable OpenTelemetry instrumentation
OTEL_EXPORTER_OTLP_ENDPOINT http://localhost:4317 OTLP gRPC endpoint for collector
OTEL_PROMETHEUS_PORT 9090 Port for Prometheus metrics from SysManage

Configuration Files

  • config/otel-collector-config.yml - OpenTelemetry Collector configuration
  • config/prometheus.yml - Prometheus scrape configuration

Customizing OpenTelemetry Collector

Edit config/otel-collector-config.yml to:

  • Add custom processors for data transformation
  • Configure additional exporters (Jaeger, Zipkin, etc.)
  • Adjust resource attributes and metadata
  • Change sampling rates and batch sizes
  • Add filters and data enrichment

Customizing Prometheus

Edit config/prometheus.yml to:

  • Add additional scrape targets
  • Configure alerting rules
  • Adjust retention settings
  • Set up remote write for long-term storage
  • Define custom labels and relabeling

Available Metrics

SysManage exposes comprehensive metrics through OpenTelemetry:

HTTP Metrics

  • http_server_duration - HTTP request duration histogram (P50, P95, P99)
  • http_server_active_requests - Active HTTP requests gauge
  • http_server_request_size - Request body size histogram
  • http_server_response_size - Response body size histogram
  • http_server_requests_total - Total HTTP request counter

Database Metrics

  • db_client_operation_duration - Database operation duration
  • db_client_connections_usage - Connection pool usage
  • db_client_connections_idle - Idle connections in pool
  • db_client_connections_max - Maximum pool size
  • db_client_operations_total - Total database operations

Application Metrics

  • process_cpu_seconds_total - Total CPU time consumed
  • process_resident_memory_bytes - Resident memory size
  • process_virtual_memory_bytes - Virtual memory size
  • process_open_fds - Number of open file descriptors
  • process_start_time_seconds - Process start time

Custom Business Metrics

Add custom metrics using the OpenTelemetry SDK:

from backend.telemetry.otel_config import get_meter

meter = get_meter(__name__)
counter = meter.create_counter(
    "hosts_managed_total",
    description="Total number of managed hosts",
    unit="1"
)
counter.add(1, {"status": "active"})

Distributed Tracing

OpenTelemetry automatically traces:

  • FastAPI HTTP requests and responses
  • SQLAlchemy database queries
  • External HTTP requests (via requests library)
  • Background task execution

Custom Spans

Add custom spans for detailed tracing:

from backend.telemetry.otel_config import get_tracer

tracer = get_tracer(__name__)

with tracer.start_as_current_span("process_agent_data") as span:
    span.set_attribute("agent.id", agent_id)
    span.set_attribute("data.size", len(data))
    # Your processing code here
    result = process_data(data)
    span.set_attribute("result.status", "success")
    return result

Viewing Traces

Access trace data through:

  • OpenTelemetry Collector zPages: http://localhost:55679/debug/tracez
  • Export to Jaeger: Configure OTLP exporter to send to Jaeger
  • Export to Zipkin: Configure OTLP exporter to send to Zipkin
  • Grafana Tempo: Use Tempo for distributed tracing in Grafana

Grafana Integration

SysManage includes native Grafana integration for professional visualization and dashboarding.

Integration Options

Option 1: Managed Grafana Server

Use a Grafana server managed by SysManage:

  1. Install Grafana on a managed host via SysManage
  2. Assign the "Monitoring Server" role with package "grafana"
  3. Navigate to Settings > Grafana Integration
  4. Enable integration and select your Grafana server
  5. Provide a Grafana API key (stored securely in OpenBAO vault)

Option 2: External Grafana Server

Use an existing external Grafana installation:

  1. Navigate to Settings > Grafana Integration
  2. Enable integration
  3. Select "Manual Configuration"
  4. Enter your Grafana URL (e.g., https://grafana.example.com)
  5. Provide a Grafana API key (stored securely in OpenBAO vault)

Grafana Setup

Creating API Key in Grafana

Important: The API key must have Admin role to automatically configure Prometheus data sources. Editor role is insufficient for data source management.

  1. Log into Grafana as an admin user
  2. Navigate to Configuration > API Keys (or Administration > Service Accounts > API Keys in newer versions)
  3. Click "Add API key" or "New API Key"
  4. Configure the key:
    • Name: SysManage Integration
    • Role: Admin (required for automatic data source configuration)
    • Time to live: Set expiration as needed (or leave blank for no expiration)
  5. Click "Add" or "Create"
  6. Copy the generated API key immediately (it will only be shown once)
  7. Store the key securely - SysManage will store it encrypted in OpenBAO vault

Configuring Grafana Integration in SysManage

  1. Navigate to Settings > Integrations in SysManage web UI
  2. Scroll to the Grafana Integration card
  3. Toggle "Enable Grafana Integration" to ON
  4. Choose your configuration method:
    • Use Managed Server: Select a Grafana server managed by SysManage (requires "Monitoring Server" role with package "grafana")
    • Manual Configuration: Toggle off "Use Managed Server" and enter your Grafana URL (e.g., http://grafana.example.com:3000)
  5. Paste the API key into the "API Key" field
  6. Click "Save"
  7. SysManage will automatically:
    • Securely store the API key in OpenBAO vault
    • Verify connectivity to Grafana
    • Create "SysManage Prometheus" data source in Grafana
    • Configure the data source to point to your Prometheus instance
  8. Click "Check Health" to verify the connection

Troubleshooting API Key Issues

  • 401 Unauthorized: API key is invalid or has been deleted in Grafana. Generate a new key.
  • 403 Permission Denied: API key has insufficient permissions. Ensure it has Admin role, not just Editor.
  • Connection timeout: Check that the Grafana URL is correct and the server is reachable from SysManage.
  • Bad gateway: If using "localhost" in Grafana URL, change to the actual hostname or IP address that Grafana can reach.

Verifying Prometheus Data Source

Note: When you configure Grafana integration with an Admin-level API key, SysManage automatically creates and configures the Prometheus data source. Manual configuration is only needed if automatic setup fails.

To verify the automatic configuration:

  1. In Grafana, navigate to Configuration > Data Sources
  2. Look for "SysManage Prometheus" in the list
  3. Click on it to view the configuration:
    • Name: SysManage Prometheus
    • URL: http://your-sysmanage-host:9091
    • Access: Server (proxy)
    • Default: Set as default data source
  4. Click "Save & Test" to verify connectivity
  5. You should see a green success message: "Data source is working"

Manual Prometheus Data Source Setup

If automatic configuration fails, you can manually add the data source:

  1. In Grafana, navigate to Configuration > Data Sources
  2. Click "Add data source"
  3. Select "Prometheus"
  4. Configure the data source:
    • Name: SysManage Prometheus
    • URL: http://your-sysmanage-host:9091 (use actual hostname, not "localhost")
    • Access: Server (default)
    • Scrape interval: 15s
    • HTTP Method: POST
  5. Click "Save & Test"

Health Check

SysManage provides an API endpoint to verify Grafana connectivity:

GET /api/grafana/health

This endpoint checks:

  • Grafana server reachability
  • API key validity
  • Grafana version information
  • Build information and features

Metrics Collection

System Performance Metrics

SysManage automatically collects comprehensive system metrics:

CPU Metrics

  • Utilization: Overall CPU usage percentage
  • Load Average: 1, 5, and 15-minute load averages
  • Per-Core Usage: Individual CPU core utilization
  • Process Breakdown: Top CPU-consuming processes
  • Context Switches: System context switch rates
  • Interrupts: Hardware and software interrupt rates

Memory Metrics

  • Total/Used/Free: Memory utilization breakdown
  • Swap Usage: Swap space utilization and activity
  • Buffer/Cache: System buffer and cache usage
  • Memory Pressure: Memory pressure indicators
  • Process Memory: Per-process memory consumption
  • Memory Leaks: Detection of potential memory leaks

Storage Metrics

  • Disk Usage: File system capacity and utilization
  • I/O Performance: Read/write throughput and latency
  • IOPS: Input/output operations per second
  • Queue Depth: Storage queue metrics
  • Disk Health: SMART data and disk health indicators
  • Mount Points: All mounted file systems status

Network Metrics

  • Interface Statistics: Traffic, errors, and drops per interface
  • Bandwidth Utilization: Network throughput and capacity
  • Connection Statistics: Active connections and states
  • Protocol Statistics: TCP, UDP, and other protocol metrics
  • DNS Performance: DNS resolution times and failures
  • Network Latency: Network round-trip times

Application and Service Metrics

Service Monitoring

  • Service Status: Running, stopped, failed service states
  • Process Monitoring: Critical process health and performance
  • Port Monitoring: Service availability on specific ports
  • Log Analysis: Application log patterns and errors
  • Resource Usage: Per-service resource consumption
  • Response Times: Application response time metrics

Database Monitoring

  • Connection Pools: Database connection utilization
  • Query Performance: Slow query detection and analysis
  • Lock Statistics: Database lock contention metrics
  • Replication Status: Database replication health
  • Storage Growth: Database size and growth trends

Web Server Monitoring

  • Request Rates: HTTP request volume and patterns
  • Response Codes: HTTP status code distribution
  • Response Times: Request processing latency
  • Connection Metrics: Active connections and limits
  • SSL Certificate: Certificate expiration monitoring

Custom Metrics

Extend monitoring with custom application metrics:

Custom Script Monitoring

  • Script Execution: Run custom monitoring scripts
  • Exit Code Monitoring: Alert on script failures
  • Output Parsing: Extract metrics from script output
  • Scheduled Execution: Define custom execution schedules
  • Timeout Handling: Handle long-running or stuck scripts

API Integration

  • REST API Calls: Monitor external API availability
  • Response Validation: Validate API response content
  • Authentication: Support for various authentication methods
  • Rate Limiting: Respect API rate limits
  • Custom Headers: Include custom headers in requests

Alerting System

Alert Configuration

Alert Types

  • Threshold Alerts: Trigger when metrics exceed defined thresholds
  • Anomaly Detection: Machine learning-based anomaly detection
  • Service Alerts: Alert on service availability and health
  • Log Pattern Alerts: Trigger on specific log patterns
  • Composite Alerts: Multi-condition alerts with logical operators
  • Predictive Alerts: Early warning based on trend analysis

Threshold Configuration

Configure alert thresholds for optimal balance between noise and coverage:

CPU Utilization Example
  • Warning: CPU > 70% for 5 minutes
  • Critical: CPU > 90% for 2 minutes
  • Recovery: CPU < 60% for 3 minutes
Memory Usage Example
  • Warning: Memory > 80% for 10 minutes
  • Critical: Memory > 95% for 1 minute
  • Recovery: Memory < 75% for 5 minutes
Disk Space Example
  • Warning: Disk > 85% usage
  • Critical: Disk > 95% usage
  • Recovery: Disk < 80% usage

Notification Channels

Supported Channels

  • Email: SMTP-based email notifications
  • Slack: Team collaboration via Slack channels
  • Microsoft Teams: Integration with Teams channels
  • Discord: Community and team notifications
  • Webhooks: Custom HTTP POST notifications
  • SMS: Text message alerts via third-party services
  • PagerDuty: Professional incident management
  • OpsGenie: Enterprise alerting and escalation

Channel Configuration

Email Configuration
SMTP Settings:
Server: mail.example.com
Port: 587
Security: STARTTLS
Authentication: username/password
Slack Integration
Webhook URL: https://hooks.slack.com/services/...
Channel: #infrastructure-alerts
Username: SysManage
Icon: :warning:
Webhook Configuration
URL: https://api.example.com/alerts
Method: POST
Headers:
  Content-Type: application/json
  Authorization: Bearer <token>

Escalation Management

Escalation Policies

Define escalation paths for different alert severities:

Level 1 Escalation (0-15 minutes)
  • Primary on-call engineer
  • Slack channel notification
  • Email to team distribution list
Level 2 Escalation (15-30 minutes)
  • Secondary on-call engineer
  • Team lead notification
  • SMS to primary on-call
Level 3 Escalation (30+ minutes)
  • Manager notification
  • Emergency contact activation
  • Multiple communication channels

Alert Suppression

  • Maintenance Windows: Suppress alerts during planned maintenance
  • Dependency Mapping: Suppress child alerts when parent systems fail
  • Frequency Limiting: Limit notification frequency for noisy alerts
  • Business Hours: Different escalation during business vs. off hours

Dashboard Configuration

Dashboard Types

Executive Dashboards

  • High-level KPIs: Overall system health and availability
  • Business Metrics: Service uptime and performance
  • Trend Analysis: Month-over-month performance trends
  • Cost Optimization: Resource utilization and efficiency

Operations Dashboards

  • Real-time Status: Current system status and alerts
  • Performance Metrics: Detailed performance indicators
  • Capacity Planning: Resource utilization and forecasting
  • Incident Tracking: Active incidents and resolution status

Application Dashboards

  • Application Performance: Response times and throughput
  • Error Tracking: Application errors and failure rates
  • User Experience: End-user performance metrics
  • Dependency Mapping: Service dependencies and health

Infrastructure Dashboards

  • System Health: CPU, memory, disk, and network status
  • Hardware Monitoring: Physical hardware health and performance
  • Network Overview: Network topology and performance
  • Security Monitoring: Security events and compliance status

Dashboard Customization

Widget Types

  • Time Series Graphs: Line charts showing metrics over time
  • Gauge Widgets: Current value displays with thresholds
  • Status Indicators: Health status and availability displays
  • Top Lists: Ranked lists of hosts or metrics
  • Heat Maps: Visual representation of metric distributions
  • Alert Summaries: Current and recent alert status

Layout Options

  • Grid Layout: Flexible grid-based widget arrangement
  • Responsive Design: Adapts to different screen sizes
  • Custom Sizing: Adjustable widget dimensions
  • Multi-page Dashboards: Organize widgets across multiple pages
  • Auto-refresh: Configurable refresh intervals

Access Control

  • Role-based Access: Different dashboards for different roles
  • Team Dashboards: Team-specific metric views
  • Public Dashboards: Read-only dashboards for stakeholders
  • Edit Permissions: Control who can modify dashboards

Health Check Configuration

System Health Checks

Core System Checks

  • Agent Connectivity: Verify all agents are connected and responsive
  • Database Health: Check database connectivity and performance
  • Queue Status: Monitor message queue health and processing
  • Certificate Validity: Check SSL/TLS certificate expiration
  • Disk Space: Verify adequate disk space on server
  • Memory Usage: Monitor server memory utilization

Service Health Checks

  • Web Server: HTTP response and performance checks
  • API Endpoints: Verify API functionality and response times
  • Background Services: Check background job processing
  • External Dependencies: Monitor external service availability
  • Load Balancer: Verify load balancer health and distribution

Custom Health Checks

Script-based Checks

#!/bin/bash
# Custom health check example
check_application_health() {
    response=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health)
    if [ "$response" -eq 200 ]; then
        echo "Application healthy"
        exit 0
    else
        echo "Application unhealthy - HTTP $response"
        exit 1
    fi
}

Database Connectivity Check

#!/bin/bash
# Database health check
check_database() {
    if pg_isready -h localhost -p 5432 -U sysmanage; then
        echo "Database accessible"
        exit 0
    else
        echo "Database not accessible"
        exit 1
    fi
}

API Health Check

#!/usr/bin/env python3
import requests
import sys

def check_api_health():
    try:
        response = requests.get('http://localhost:8443/api/health', timeout=10)
        if response.status_code == 200:
            print("API healthy")
            sys.exit(0)
        else:
            print(f"API unhealthy - Status: {response.status_code}")
            sys.exit(1)
    except Exception as e:
        print(f"API check failed: {e}")
        sys.exit(1)

Monitoring Performance Optimization

Data Retention Policies

Retention Strategies

  • High-resolution data: 24 hours of 1-minute intervals
  • Medium-resolution data: 7 days of 5-minute intervals
  • Low-resolution data: 90 days of 1-hour intervals
  • Archive data: 1 year of daily summaries
  • Long-term trends: Multi-year monthly summaries

Storage Optimization

  • Data Compression: Compress historical data
  • Archival: Move old data to cold storage
  • Purging: Automatically delete expired data
  • Indexing: Optimize database indexes for queries

Collection Optimization

Collection Intervals

  • Critical Metrics: 30-second intervals
  • Standard Metrics: 1-minute intervals
  • Resource Intensive: 5-minute intervals
  • Static Data: Hourly or daily collection

Agent Performance

  • Batch Collection: Collect multiple metrics efficiently
  • Compression: Compress data before transmission
  • Buffering: Buffer data during network outages
  • Priority Queues: Prioritize critical metrics

Monitoring Best Practices

Alert Design Best Practices

  • Actionable Alerts: Only alert on conditions requiring action
  • Appropriate Thresholds: Set thresholds based on historical data
  • Context Information: Include relevant context in alerts
  • Clear Recovery: Define clear recovery conditions
  • Avoid Alert Fatigue: Minimize false positives
  • Business Impact: Correlate alerts with business impact

Dashboard Design Best Practices

  • Audience-specific: Design dashboards for specific audiences
  • Information Hierarchy: Most critical information first
  • Visual Clarity: Use clear, readable visualizations
  • Performance Focus: Optimize dashboard loading times
  • Regular Review: Regularly review and update dashboards

Operational Best Practices

  • Regular Maintenance: Schedule regular monitoring system maintenance
  • Capacity Planning: Monitor and plan for system growth
  • Documentation: Document alert procedures and escalation paths
  • Training: Train team members on monitoring tools and procedures
  • Continuous Improvement: Regularly review and improve monitoring setup