Documentation > API Reference > Monitoring

Monitoring API

System monitoring, metrics collection, and alerting through programmatic interfaces.

Overview

The Monitoring API provides comprehensive system monitoring capabilities including real-time metrics collection, health checks, alert management, and historical data analysis across your entire infrastructure.

Diagnostics Collection

POST /api/v1/host/{host_id}/collect-diagnostics

Initiate comprehensive diagnostic data collection for a specific host.

🔒 Authentication Required

Path Parameters

  • host_id (string) - UUID of the target host

Request Body

{
  "include_logs": true,
  "include_processes": true,
  "include_network": true,
  "include_storage": true,
  "log_lines": 1000,
  "custom_commands": [
    "systemctl status",
    "df -h",
    "free -m"
  ]
}

Response (200 OK)

{
  "diagnostic_id": "uuid",
  "status": "initiated",
  "estimated_completion": "2024-01-01T12:05:00Z",
  "collection_size_estimate": "2.5 MB"
}
GET /api/v1/host/{host_id}/diagnostics

Get list of diagnostic collections for a specific host.

🔒 Authentication Required

Response (200 OK)

[
  {
    "diagnostic_id": "uuid",
    "collected_at": "2024-01-01T12:00:00Z",
    "status": "completed",
    "size": "2.3 MB",
    "components": [
      "system_info",
      "processes",
      "network",
      "storage",
      "logs"
    ]
  }
]
GET /api/v1/diagnostic/{diagnostic_id}

Get detailed diagnostic data for a specific collection.

🔒 Authentication Required

Response (200 OK)

{
  "diagnostic_id": "uuid",
  "host_id": "uuid",
  "collected_at": "2024-01-01T12:00:00Z",
  "status": "completed",
  "data": {
    "system_info": {
      "hostname": "web-01",
      "uptime": "15 days, 3:22:45",
      "load_average": [1.2, 1.5, 1.8],
      "cpu_cores": 4,
      "memory_total": "8 GB",
      "disk_usage": [
        {
          "mount": "/",
          "used": "45%",
          "available": "55%"
        }
      ]
    },
    "processes": [
      {
        "pid": 1234,
        "name": "nginx",
        "cpu_percent": 2.3,
        "memory_percent": 1.8,
        "status": "running"
      }
    ],
    "network": {
      "interfaces": [
        {
          "name": "eth0",
          "ip": "192.168.1.100",
          "status": "up",
          "rx_bytes": 1234567890,
          "tx_bytes": 987654321
        }
      ]
    }
  }
}
GET /api/v1/diagnostic/{diagnostic_id}/status

Get status of a diagnostic collection operation.

🔒 Authentication Required

Response (200 OK)

{
  "diagnostic_id": "uuid",
  "status": "in_progress",
  "progress": 65,
  "current_step": "collecting_logs",
  "estimated_completion": "2024-01-01T12:03:00Z",
  "error": null
}
DELETE /api/v1/diagnostic/{diagnostic_id}

Delete a diagnostic collection and its data.

🔒 Authentication Required

Response (200 OK)

{
  "message": "Diagnostic collection deleted successfully",
  "diagnostic_id": "uuid"
}

Queue Monitoring

GET /api/v1/queue/failed

Get list of failed queue messages for monitoring and debugging.

🔒 Authentication Required

Query Parameters

  • limit (integer, optional) - Maximum results (default: 50)
  • offset (integer, optional) - Results offset (default: 0)

Response (200 OK)

[
  {
    "message_id": "uuid",
    "queue_name": "host_commands",
    "message_type": "command_execution",
    "failed_at": "2024-01-01T12:00:00Z",
    "retry_count": 3,
    "error": "Connection timeout to host",
    "payload": {
      "host_id": "uuid",
      "command": "systemctl restart nginx"
    }
  }
]
GET /api/v1/queue/failed/{message_id}

Get detailed information about a specific failed message.

🔒 Authentication Required

Response (200 OK)

{
  "message_id": "uuid",
  "queue_name": "host_commands",
  "message_type": "command_execution",
  "created_at": "2024-01-01T11:55:00Z",
  "failed_at": "2024-01-01T12:00:00Z",
  "retry_count": 3,
  "max_retries": 3,
  "error": "Connection timeout to host web-01",
  "error_stack": "...",
  "payload": {
    "host_id": "uuid",
    "command": "systemctl restart nginx",
    "timeout": 30
  },
  "retry_history": [
    {
      "attempt": 1,
      "failed_at": "2024-01-01T11:56:00Z",
      "error": "Connection timeout"
    }
  ]
}
DELETE /api/v1/queue/failed

Clear all failed messages from the queue.

🔒 Authentication Required

Request Body (Optional)

{
  "older_than": "2024-01-01T00:00:00Z",
  "queue_name": "host_commands"
}

Response (200 OK)

{
  "message": "Failed messages cleared",
  "count": 25
}

Security Monitoring

GET /api/v1/security/default-credentials-status

Check if the system is using default credentials (security monitoring).

🔒 Authentication Required

Response (200 OK)

{
  "using_default_credentials": false,
  "last_password_change": "2024-01-01T10:00:00Z",
  "admin_users_count": 3,
  "users_with_default_passwords": [],
  "security_score": 95,
  "recommendations": [
    "Enable two-factor authentication",
    "Rotate API keys monthly"
  ]
}

Email Configuration Monitoring

GET /api/v1/email/config

Get email configuration status for monitoring alert delivery.

🔒 Authentication Required

Response (200 OK)

{
  "configured": true,
  "smtp_server": "smtp.example.com",
  "smtp_port": 587,
  "use_tls": true,
  "from_address": "alerts@example.com",
  "last_test": "2024-01-01T10:00:00Z",
  "test_status": "success"
}
POST /api/v1/email/test

Test email configuration by sending a test email.

🔒 Authentication Required

Request Body

{
  "to_address": "admin@example.com",
  "subject": "SysManage Email Test"
}

Response (200 OK)

{
  "success": true,
  "message": "Test email sent successfully",
  "sent_at": "2024-01-01T12:00:00Z"
}

System Health Checks

GET /api/v1/health

Get overall system health status.

Response (200 OK)

{
  "status": "healthy",
  "timestamp": "2024-01-01T12:00:00Z",
  "version": "1.0.0",
  "components": {
    "database": {
      "status": "healthy",
      "response_time_ms": 12
    },
    "websocket": {
      "status": "healthy",
      "active_connections": 25
    },
    "queue": {
      "status": "healthy",
      "pending_messages": 3,
      "failed_messages": 0
    },
    "agents": {
      "status": "degraded",
      "online": 148,
      "offline": 2,
      "total": 150
    }
  }
}
GET /api/v1/health/database

Get detailed database health information.

🔒 Authentication Required

Response (200 OK)

{
  "status": "healthy",
  "connection_pool": {
    "active": 5,
    "idle": 10,
    "max": 20
  },
  "query_performance": {
    "avg_response_time_ms": 8.5,
    "slow_queries": 0
  },
  "disk_usage": {
    "size": "1.2 GB",
    "growth_rate": "50 MB/day"
  },
  "last_backup": "2024-01-01T02:00:00Z"
}

Metrics and Analytics

GET /api/v1/metrics/summary

Get aggregated metrics summary across the infrastructure.

🔒 Authentication Required

Query Parameters

  • timeframe (string) - Time period (hour, day, week, month)
  • metrics (array) - Specific metrics to include

Response (200 OK)

{
  "timeframe": "24h",
  "summary": {
    "total_hosts": 150,
    "avg_cpu_usage": 23.5,
    "avg_memory_usage": 67.2,
    "avg_disk_usage": 45.8,
    "total_commands_executed": 1247,
    "total_packages_managed": 356,
    "alerts_generated": 12,
    "uptime_percentage": 99.8
  },
  "trends": {
    "cpu_usage": "stable",
    "memory_usage": "increasing",
    "disk_usage": "stable",
    "network_traffic": "decreasing"
  }
}
GET /api/v1/metrics/hosts/{host_id}/history

Get historical metrics data for a specific host.

🔒 Authentication Required

Query Parameters

  • start_time (string) - Start timestamp (ISO 8601)
  • end_time (string) - End timestamp (ISO 8601)
  • interval (string) - Data interval (5m, 1h, 1d)

Response (200 OK)

{
  "host_id": "uuid",
  "interval": "1h",
  "data_points": [
    {
      "timestamp": "2024-01-01T11:00:00Z",
      "cpu_usage": 25.3,
      "memory_usage": 65.1,
      "disk_usage": 45.8,
      "network_rx": 1024,
      "network_tx": 2048
    }
  ]
}

Alert Management

GET /api/v1/alerts

Get list of active alerts.

🔒 Authentication Required

Query Parameters

  • severity (string) - Filter by severity (info, warning, critical)
  • status (string) - Filter by status (active, acknowledged, resolved)
  • host_id (string) - Filter by specific host

Response (200 OK)

[
  {
    "alert_id": "uuid",
    "severity": "warning",
    "title": "High disk usage",
    "description": "Disk usage on /var is 85%",
    "host_id": "uuid",
    "hostname": "web-01",
    "triggered_at": "2024-01-01T12:00:00Z",
    "status": "active",
    "metric": "disk_usage",
    "threshold": 80,
    "current_value": 85
  }
]
POST /api/v1/alerts/{alert_id}/acknowledge

Acknowledge an alert.

🔒 Authentication Required

Request Body

{
  "message": "Investigating high disk usage",
  "acknowledged_by": "admin"
}

Response (200 OK)

{
  "alert_id": "uuid",
  "status": "acknowledged",
  "acknowledged_at": "2024-01-01T12:05:00Z",
  "acknowledged_by": "admin"
}

Important Notes

  • Diagnostic collections can be resource-intensive - schedule during low-usage periods
  • Failed queue messages indicate potential system issues requiring attention
  • Regular health checks help identify issues before they become critical
  • Historical metrics data is retained for 90 days by default
  • Alert thresholds can be customized through configuration
  • Email configuration is required for alert notifications