Monitoring & Alerts

Comprehensive monitoring setup including alerting, performance metrics, and system health tracking for proactive infrastructure management.

Overview

SysManage provides comprehensive monitoring capabilities that enable proactive infrastructure management through real-time metrics collection, intelligent alerting, and customizable dashboards. The monitoring system is designed to scale with your infrastructure while maintaining performance and reliability.

Monitoring Capabilities

OpenTelemetry Integration: Industry-standard observability with distributed tracing and metrics
Prometheus Metrics: Time-series metrics storage and powerful querying with PromQL
Grafana Integration: Professional visualization and dashboarding platform
Real-time Metrics: Continuous collection of system performance data
Intelligent Alerting: Smart threshold-based and anomaly detection alerts
Custom Dashboards: Configurable views for different teams and use cases
Historical Analysis: Long-term trend analysis and capacity planning
Multi-channel Notifications: Email, Slack, webhooks, and more
Escalation Management: Automated escalation based on severity and response time

OpenTelemetry & Prometheus Stack

SysManage includes built-in support for industry-standard observability tools, providing comprehensive application and infrastructure monitoring.

Telemetry Architecture

┌─────────────┐     OTLP      ┌──────────────────┐     Scrape    ┌────────────┐
│  SysManage  │──────────────>│  OpenTelemetry   │──────────────>│ Prometheus │
│   Backend   │               │    Collector     │               │            │
└─────────────┘               └──────────────────┘               └────────────┘
      │                              │                                   │
      │ Prometheus                   │ Logs/Traces                       │
      │ Metrics                      │                                   │
      └──────────────────────────────┴───────────────────────────────────┘
                                     │
                                     v
                              ┌────────────┐
                              │  Grafana   │
                              │ (optional) │
                              └────────────┘

Quick Start

1. Install Telemetry Stack

The telemetry stack is automatically installed during development setup:

make install-dev

This installs:

OpenTelemetry Collector Contrib
Prometheus time-series database
Python OpenTelemetry instrumentation libraries

2. Start Services

Telemetry services start automatically with SysManage:

make start

This starts in order:

OpenBAO (secrets management)
OpenTelemetry Collector (ports 4317, 4318, 8888, 55679)
Prometheus (port 9091)
SysManage Backend (with telemetry enabled)

3. Enable Telemetry

Set the environment variable to enable OpenTelemetry instrumentation:

export OTEL_ENABLED=true
./run.sh

Or add to your shell configuration (.bashrc, .zshrc, etc.):

export OTEL_ENABLED=true

4. Verify Telemetry

Check that services are running:

make status-telemetry

Access web interfaces:

Prometheus UI: http://localhost:9091
OpenTelemetry Collector zPages: http://localhost:55679/debug/tracez
SysManage Metrics Endpoint: http://localhost:9090/metrics

Configuration

Environment Variables

Variable	Default	Description
`OTEL_ENABLED`	`false`	Enable/disable OpenTelemetry instrumentation
`OTEL_EXPORTER_OTLP_ENDPOINT`	`http://localhost:4317`	OTLP gRPC endpoint for collector
`OTEL_PROMETHEUS_PORT`	`9090`	Port for Prometheus metrics from SysManage

Configuration Files

config/otel-collector-config.yml - OpenTelemetry Collector configuration
config/prometheus.yml - Prometheus scrape configuration

Customizing OpenTelemetry Collector

Edit config/otel-collector-config.yml to:

Add custom processors for data transformation
Configure additional exporters (Jaeger, Zipkin, etc.)
Adjust resource attributes and metadata
Change sampling rates and batch sizes
Add filters and data enrichment

Customizing Prometheus

Edit config/prometheus.yml to:

Add additional scrape targets
Configure alerting rules
Adjust retention settings
Set up remote write for long-term storage
Define custom labels and relabeling

Available Metrics

SysManage exposes comprehensive metrics through OpenTelemetry:

HTTP Metrics

http_server_duration - HTTP request duration histogram (P50, P95, P99)
http_server_active_requests - Active HTTP requests gauge
http_server_request_size - Request body size histogram
http_server_response_size - Response body size histogram
http_server_requests_total - Total HTTP request counter

Database Metrics

db_client_operation_duration - Database operation duration
db_client_connections_usage - Connection pool usage
db_client_connections_idle - Idle connections in pool
db_client_connections_max - Maximum pool size
db_client_operations_total - Total database operations

Application Metrics

process_cpu_seconds_total - Total CPU time consumed
process_resident_memory_bytes - Resident memory size
process_virtual_memory_bytes - Virtual memory size
process_open_fds - Number of open file descriptors
process_start_time_seconds - Process start time

Custom Business Metrics

Add custom metrics using the OpenTelemetry SDK:

from backend.telemetry.otel_config import get_meter

meter = get_meter(__name__)
counter = meter.create_counter(
    "hosts_managed_total",
    description="Total number of managed hosts",
    unit="1"
)
counter.add(1, {"status": "active"})

Distributed Tracing

OpenTelemetry automatically traces:

FastAPI HTTP requests and responses
SQLAlchemy database queries
External HTTP requests (via requests library)
Background task execution

Custom Spans

Add custom spans for detailed tracing:

from backend.telemetry.otel_config import get_tracer

tracer = get_tracer(__name__)

with tracer.start_as_current_span("process_agent_data") as span:
    span.set_attribute("agent.id", agent_id)
    span.set_attribute("data.size", len(data))
    # Your processing code here
    result = process_data(data)
    span.set_attribute("result.status", "success")
    return result

Viewing Traces

Access trace data through:

OpenTelemetry Collector zPages: http://localhost:55679/debug/tracez
Export to Jaeger: Configure OTLP exporter to send to Jaeger
Export to Zipkin: Configure OTLP exporter to send to Zipkin
Grafana Tempo: Use Tempo for distributed tracing in Grafana

Grafana Integration

SysManage includes native Grafana integration for professional visualization and dashboarding.

Integration Options

Option 1: Managed Grafana Server

Use a Grafana server managed by SysManage:

Install Grafana on a managed host via SysManage
Assign the "Monitoring Server" role with package "grafana"
Navigate to Settings > Grafana Integration
Enable integration and select your Grafana server
Provide a Grafana API key (stored securely in OpenBAO vault)

Option 2: External Grafana Server

Use an existing external Grafana installation:

Navigate to Settings > Grafana Integration
Enable integration
Select "Manual Configuration"
Enter your Grafana URL (e.g., https://grafana.example.com)
Provide a Grafana API key (stored securely in OpenBAO vault)

Grafana Setup

Creating API Key in Grafana

Important: The API key must have Admin role to automatically configure Prometheus data sources. Editor role is insufficient for data source management.

Log into Grafana as an admin user
Navigate to Configuration > API Keys (or Administration > Service Accounts > API Keys in newer versions)
Click "Add API key" or "New API Key"
Configure the key:
- Name: SysManage Integration
- Role: Admin (required for automatic data source configuration)
- Time to live: Set expiration as needed (or leave blank for no expiration)
Click "Add" or "Create"
Copy the generated API key immediately (it will only be shown once)
Store the key securely - SysManage will store it encrypted in OpenBAO vault

Configuring Grafana Integration in SysManage

Navigate to Settings > Integrations in SysManage web UI
Scroll to the Grafana Integration card
Toggle "Enable Grafana Integration" to ON
Choose your configuration method:
- Use Managed Server: Select a Grafana server managed by SysManage (requires "Monitoring Server" role with package "grafana")
- Manual Configuration: Toggle off "Use Managed Server" and enter your Grafana URL (e.g., http://grafana.example.com:3000)
Paste the API key into the "API Key" field
Click "Save"
SysManage will automatically:
- Securely store the API key in OpenBAO vault
- Verify connectivity to Grafana
- Create "SysManage Prometheus" data source in Grafana
- Configure the data source to point to your Prometheus instance
Click "Check Health" to verify the connection

Troubleshooting API Key Issues

401 Unauthorized: API key is invalid or has been deleted in Grafana. Generate a new key.
403 Permission Denied: API key has insufficient permissions. Ensure it has Admin role, not just Editor.
Connection timeout: Check that the Grafana URL is correct and the server is reachable from SysManage.
Bad gateway: If using "localhost" in Grafana URL, change to the actual hostname or IP address that Grafana can reach.

Verifying Prometheus Data Source

Note: When you configure Grafana integration with an Admin-level API key, SysManage automatically creates and configures the Prometheus data source. Manual configuration is only needed if automatic setup fails.

To verify the automatic configuration:

In Grafana, navigate to Configuration > Data Sources
Look for "SysManage Prometheus" in the list
Click on it to view the configuration:
- Name: SysManage Prometheus
- URL: http://your-sysmanage-host:9091
- Access: Server (proxy)
- Default: Set as default data source
Click "Save & Test" to verify connectivity
You should see a green success message: "Data source is working"

Manual Prometheus Data Source Setup

If automatic configuration fails, you can manually add the data source:

In Grafana, navigate to Configuration > Data Sources
Click "Add data source"
Select "Prometheus"
Configure the data source:
- Name: SysManage Prometheus
- URL: http://your-sysmanage-host:9091 (use actual hostname, not "localhost")
- Access: Server (default)
- Scrape interval: 15s
- HTTP Method: POST
Click "Save & Test"

Recommended Dashboard Panels

Application Performance Dashboard

Request Rate: rate(http_server_requests_total[5m])
Response Time (P95): histogram_quantile(0.95, http_server_duration_bucket)
Error Rate: rate(http_server_requests_total{status_code=~"5.."}[5m])
Active Requests: http_server_active_requests

Database Performance Dashboard

Query Duration: histogram_quantile(0.95, db_client_operation_duration_bucket)
Connection Pool Usage: db_client_connections_usage / db_client_connections_max
Idle Connections: db_client_connections_idle
Operations Rate: rate(db_client_operations_total[5m])

System Resources Dashboard

CPU Usage: rate(process_cpu_seconds_total[1m])
Memory Usage: process_resident_memory_bytes
Open File Descriptors: process_open_fds
Uptime: time() - process_start_time_seconds

Health Check

SysManage provides an API endpoint to verify Grafana connectivity:

GET /api/grafana/health

This endpoint checks:

Grafana server reachability
API key validity
Grafana version information
Build information and features

Metrics Collection

System Performance Metrics

SysManage automatically collects comprehensive system metrics:

CPU Metrics

Utilization: Overall CPU usage percentage
Load Average: 1, 5, and 15-minute load averages
Per-Core Usage: Individual CPU core utilization
Process Breakdown: Top CPU-consuming processes
Context Switches: System context switch rates
Interrupts: Hardware and software interrupt rates

Memory Metrics

Total/Used/Free: Memory utilization breakdown
Swap Usage: Swap space utilization and activity
Buffer/Cache: System buffer and cache usage
Memory Pressure: Memory pressure indicators
Process Memory: Per-process memory consumption
Memory Leaks: Detection of potential memory leaks

Storage Metrics

Disk Usage: File system capacity and utilization
I/O Performance: Read/write throughput and latency
IOPS: Input/output operations per second
Queue Depth: Storage queue metrics
Disk Health: SMART data and disk health indicators
Mount Points: All mounted file systems status

Network Metrics

Interface Statistics: Traffic, errors, and drops per interface
Bandwidth Utilization: Network throughput and capacity
Connection Statistics: Active connections and states
Protocol Statistics: TCP, UDP, and other protocol metrics
DNS Performance: DNS resolution times and failures
Network Latency: Network round-trip times

Application and Service Metrics

Service Monitoring

Service Status: Running, stopped, failed service states
Process Monitoring: Critical process health and performance
Port Monitoring: Service availability on specific ports
Log Analysis: Application log patterns and errors
Resource Usage: Per-service resource consumption
Response Times: Application response time metrics

Database Monitoring

Connection Pools: Database connection utilization
Query Performance: Slow query detection and analysis
Lock Statistics: Database lock contention metrics
Replication Status: Database replication health
Storage Growth: Database size and growth trends

Web Server Monitoring

Request Rates: HTTP request volume and patterns
Response Codes: HTTP status code distribution
Response Times: Request processing latency
Connection Metrics: Active connections and limits
SSL Certificate: Certificate expiration monitoring

Custom Metrics

Extend monitoring with custom application metrics:

Custom Script Monitoring

Script Execution: Run custom monitoring scripts
Exit Code Monitoring: Alert on script failures
Output Parsing: Extract metrics from script output
Scheduled Execution: Define custom execution schedules
Timeout Handling: Handle long-running or stuck scripts

API Integration

REST API Calls: Monitor external API availability
Response Validation: Validate API response content
Authentication: Support for various authentication methods
Rate Limiting: Respect API rate limits
Custom Headers: Include custom headers in requests

Alerting System

Alert Configuration

Alert Types

Threshold Alerts: Trigger when metrics exceed defined thresholds
Anomaly Detection: Machine learning-based anomaly detection
Service Alerts: Alert on service availability and health
Log Pattern Alerts: Trigger on specific log patterns
Composite Alerts: Multi-condition alerts with logical operators
Predictive Alerts: Early warning based on trend analysis

Threshold Configuration

Configure alert thresholds for optimal balance between noise and coverage:

CPU Utilization Example

Warning: CPU > 70% for 5 minutes
Critical: CPU > 90% for 2 minutes
Recovery: CPU < 60% for 3 minutes

Memory Usage Example

Warning: Memory > 80% for 10 minutes
Critical: Memory > 95% for 1 minute
Recovery: Memory < 75% for 5 minutes

Disk Space Example

Warning: Disk > 85% usage
Critical: Disk > 95% usage
Recovery: Disk < 80% usage

Notification Channels

Channel Configuration

Email Configuration

SMTP Settings:
Server: mail.example.com
Port: 587
Security: STARTTLS
Authentication: username/password

Slack Integration

Webhook URL: https://hooks.slack.com/services/...
Channel: #infrastructure-alerts
Username: SysManage
Icon: :warning:

Webhook Configuration

URL: https://api.example.com/alerts
Method: POST
Headers:
  Content-Type: application/json
  Authorization: Bearer <token>

Escalation Management

Escalation Policies

Define escalation paths for different alert severities:

Level 1 Escalation (0-15 minutes)

Primary on-call engineer
Slack channel notification
Email to team distribution list

Level 2 Escalation (15-30 minutes)

Secondary on-call engineer
Team lead notification
SMS to primary on-call

Level 3 Escalation (30+ minutes)

Manager notification
Emergency contact activation
Multiple communication channels

Alert Suppression

Maintenance Windows: Suppress alerts during planned maintenance
Dependency Mapping: Suppress child alerts when parent systems fail
Frequency Limiting: Limit notification frequency for noisy alerts
Business Hours: Different escalation during business vs. off hours

Dashboard Configuration

Dashboard Types

Executive Dashboards

High-level KPIs: Overall system health and availability
Business Metrics: Service uptime and performance
Trend Analysis: Month-over-month performance trends
Cost Optimization: Resource utilization and efficiency

Operations Dashboards

Real-time Status: Current system status and alerts
Performance Metrics: Detailed performance indicators
Capacity Planning: Resource utilization and forecasting
Incident Tracking: Active incidents and resolution status

Application Dashboards

Application Performance: Response times and throughput
Error Tracking: Application errors and failure rates
User Experience: End-user performance metrics
Dependency Mapping: Service dependencies and health

Infrastructure Dashboards

System Health: CPU, memory, disk, and network status
Hardware Monitoring: Physical hardware health and performance
Network Overview: Network topology and performance
Security Monitoring: Security events and compliance status

Dashboard Customization

Widget Types

Time Series Graphs: Line charts showing metrics over time
Gauge Widgets: Current value displays with thresholds
Status Indicators: Health status and availability displays
Top Lists: Ranked lists of hosts or metrics
Heat Maps: Visual representation of metric distributions
Alert Summaries: Current and recent alert status

Layout Options

Grid Layout: Flexible grid-based widget arrangement
Responsive Design: Adapts to different screen sizes
Custom Sizing: Adjustable widget dimensions
Multi-page Dashboards: Organize widgets across multiple pages
Auto-refresh: Configurable refresh intervals

Access Control

Role-based Access: Different dashboards for different roles
Team Dashboards: Team-specific metric views
Public Dashboards: Read-only dashboards for stakeholders
Edit Permissions: Control who can modify dashboards

Health Check Configuration

System Health Checks

Core System Checks

Agent Connectivity: Verify all agents are connected and responsive
Database Health: Check database connectivity and performance
Queue Status: Monitor message queue health and processing
Certificate Validity: Check SSL/TLS certificate expiration
Disk Space: Verify adequate disk space on server
Memory Usage: Monitor server memory utilization

Service Health Checks

Web Server: HTTP response and performance checks
API Endpoints: Verify API functionality and response times
Background Services: Check background job processing
External Dependencies: Monitor external service availability
Load Balancer: Verify load balancer health and distribution

Custom Health Checks

Script-based Checks

#!/bin/bash
# Custom health check example
check_application_health() {
    response=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health)
    if [ "$response" -eq 200 ]; then
        echo "Application healthy"
        exit 0
    else
        echo "Application unhealthy - HTTP $response"
        exit 1
    fi
}

Database Connectivity Check

#!/bin/bash
# Database health check
check_database() {
    if pg_isready -h localhost -p 5432 -U sysmanage; then
        echo "Database accessible"
        exit 0
    else
        echo "Database not accessible"
        exit 1
    fi
}

API Health Check

#!/usr/bin/env python3
import requests
import sys

def check_api_health():
    try:
        response = requests.get('http://localhost:8443/api/health', timeout=10)
        if response.status_code == 200:
            print("API healthy")
            sys.exit(0)
        else:
            print(f"API unhealthy - Status: {response.status_code}")
            sys.exit(1)
    except Exception as e:
        print(f"API check failed: {e}")
        sys.exit(1)

Monitoring Performance Optimization

Data Retention Policies

Retention Strategies

High-resolution data: 24 hours of 1-minute intervals
Medium-resolution data: 7 days of 5-minute intervals
Low-resolution data: 90 days of 1-hour intervals
Archive data: 1 year of daily summaries
Long-term trends: Multi-year monthly summaries

Storage Optimization

Data Compression: Compress historical data
Archival: Move old data to cold storage
Purging: Automatically delete expired data
Indexing: Optimize database indexes for queries

Collection Optimization

Collection Intervals

Critical Metrics: 30-second intervals
Standard Metrics: 1-minute intervals
Resource Intensive: 5-minute intervals
Static Data: Hourly or daily collection

Agent Performance

Batch Collection: Collect multiple metrics efficiently
Compression: Compress data before transmission
Buffering: Buffer data during network outages
Priority Queues: Prioritize critical metrics

Monitoring Best Practices

Alert Design Best Practices

Actionable Alerts: Only alert on conditions requiring action
Appropriate Thresholds: Set thresholds based on historical data
Context Information: Include relevant context in alerts
Clear Recovery: Define clear recovery conditions
Avoid Alert Fatigue: Minimize false positives
Business Impact: Correlate alerts with business impact

Dashboard Design Best Practices

Audience-specific: Design dashboards for specific audiences
Information Hierarchy: Most critical information first
Visual Clarity: Use clear, readable visualizations
Performance Focus: Optimize dashboard loading times
Regular Review: Regularly review and update dashboards

Operational Best Practices

Regular Maintenance: Schedule regular monitoring system maintenance
Capacity Planning: Monitor and plan for system growth
Documentation: Document alert procedures and escalation paths
Training: Train team members on monitoring tools and procedures
Continuous Improvement: Regularly review and improve monitoring setup