Monitoring API

System monitoring, metrics collection, and alerting through programmatic interfaces.

Overview

The Monitoring API provides comprehensive system monitoring capabilities including real-time metrics collection, health checks, alert management, and historical data analysis across your entire infrastructure.

Diagnostics Collection

POST /api/v1/host/{host_id}/collect-diagnostics

Initiate comprehensive diagnostic data collection for a specific host.

🔒 Authentication Required

Path Parameters

host_id (string) - UUID of the target host

Request Body

{
  "include_logs": true,
  "include_processes": true,
  "include_network": true,
  "include_storage": true,
  "log_lines": 1000,
  "custom_commands": [
    "systemctl status",
    "df -h",
    "free -m"
  ]
}

Response (200 OK)

{
  "diagnostic_id": "uuid",
  "status": "initiated",
  "estimated_completion": "2024-01-01T12:05:00Z",
  "collection_size_estimate": "2.5 MB"
}

GET /api/v1/host/{host_id}/diagnostics

Get list of diagnostic collections for a specific host.

🔒 Authentication Required

Response (200 OK)

[
  {
    "diagnostic_id": "uuid",
    "collected_at": "2024-01-01T12:00:00Z",
    "status": "completed",
    "size": "2.3 MB",
    "components": [
      "system_info",
      "processes",
      "network",
      "storage",
      "logs"
    ]
  }
]

GET /api/v1/diagnostic/{diagnostic_id}

Get detailed diagnostic data for a specific collection.

🔒 Authentication Required

Response (200 OK)

{
  "diagnostic_id": "uuid",
  "host_id": "uuid",
  "collected_at": "2024-01-01T12:00:00Z",
  "status": "completed",
  "data": {
    "system_info": {
      "hostname": "web-01",
      "uptime": "15 days, 3:22:45",
      "load_average": [1.2, 1.5, 1.8],
      "cpu_cores": 4,
      "memory_total": "8 GB",
      "disk_usage": [
        {
          "mount": "/",
          "used": "45%",
          "available": "55%"
        }
      ]
    },
    "processes": [
      {
        "pid": 1234,
        "name": "nginx",
        "cpu_percent": 2.3,
        "memory_percent": 1.8,
        "status": "running"
      }
    ],
    "network": {
      "interfaces": [
        {
          "name": "eth0",
          "ip": "192.168.1.100",
          "status": "up",
          "rx_bytes": 1234567890,
          "tx_bytes": 987654321
        }
      ]
    }
  }
}

GET /api/v1/diagnostic/{diagnostic_id}/status

Get status of a diagnostic collection operation.

🔒 Authentication Required

Response (200 OK)

{
  "diagnostic_id": "uuid",
  "status": "in_progress",
  "progress": 65,
  "current_step": "collecting_logs",
  "estimated_completion": "2024-01-01T12:03:00Z",
  "error": null
}

DELETE /api/v1/diagnostic/{diagnostic_id}

Delete a diagnostic collection and its data.

🔒 Authentication Required

Response (200 OK)

{
  "message": "Diagnostic collection deleted successfully",
  "diagnostic_id": "uuid"
}

Queue Monitoring

GET /api/v1/queue/failed

Get list of failed queue messages for monitoring and debugging.

🔒 Authentication Required

Query Parameters

limit (integer, optional) - Maximum results (default: 50)
offset (integer, optional) - Results offset (default: 0)

Response (200 OK)

[
  {
    "message_id": "uuid",
    "queue_name": "host_commands",
    "message_type": "command_execution",
    "failed_at": "2024-01-01T12:00:00Z",
    "retry_count": 3,
    "error": "Connection timeout to host",
    "payload": {
      "host_id": "uuid",
      "command": "systemctl restart nginx"
    }
  }
]

GET /api/v1/queue/failed/{message_id}

Get detailed information about a specific failed message.

🔒 Authentication Required

Response (200 OK)

{
  "message_id": "uuid",
  "queue_name": "host_commands",
  "message_type": "command_execution",
  "created_at": "2024-01-01T11:55:00Z",
  "failed_at": "2024-01-01T12:00:00Z",
  "retry_count": 3,
  "max_retries": 3,
  "error": "Connection timeout to host web-01",
  "error_stack": "...",
  "payload": {
    "host_id": "uuid",
    "command": "systemctl restart nginx",
    "timeout": 30
  },
  "retry_history": [
    {
      "attempt": 1,
      "failed_at": "2024-01-01T11:56:00Z",
      "error": "Connection timeout"
    }
  ]
}

DELETE /api/v1/queue/failed

Clear all failed messages from the queue.

🔒 Authentication Required

Request Body (Optional)

{
  "older_than": "2024-01-01T00:00:00Z",
  "queue_name": "host_commands"
}

Response (200 OK)

{
  "message": "Failed messages cleared",
  "count": 25
}

Security Monitoring

GET /api/v1/security/default-credentials-status

Check if the system is using default credentials (security monitoring).

🔒 Authentication Required

Response (200 OK)

{
  "using_default_credentials": false,
  "last_password_change": "2024-01-01T10:00:00Z",
  "admin_users_count": 3,
  "users_with_default_passwords": [],
  "security_score": 95,
  "recommendations": [
    "Enable two-factor authentication",
    "Rotate API keys monthly"
  ]
}

Email Configuration Monitoring

GET /api/v1/email/config

Get email configuration status for monitoring alert delivery.

🔒 Authentication Required

Response (200 OK)

{
  "configured": true,
  "smtp_server": "smtp.example.com",
  "smtp_port": 587,
  "use_tls": true,
  "from_address": "alerts@example.com",
  "last_test": "2024-01-01T10:00:00Z",
  "test_status": "success"
}

POST /api/v1/email/test

Test email configuration by sending a test email.

🔒 Authentication Required

Request Body

{
  "to_address": "admin@example.com",
  "subject": "SysManage Email Test"
}

Response (200 OK)

{
  "success": true,
  "message": "Test email sent successfully",
  "sent_at": "2024-01-01T12:00:00Z"
}

System Health Checks

GET /api/v1/health

Get overall system health status.

Response (200 OK)

{
  "status": "healthy",
  "timestamp": "2024-01-01T12:00:00Z",
  "version": "1.0.0",
  "components": {
    "database": {
      "status": "healthy",
      "response_time_ms": 12
    },
    "websocket": {
      "status": "healthy",
      "active_connections": 25
    },
    "queue": {
      "status": "healthy",
      "pending_messages": 3,
      "failed_messages": 0
    },
    "agents": {
      "status": "degraded",
      "online": 148,
      "offline": 2,
      "total": 150
    }
  }
}

GET /api/v1/health/database

Get detailed database health information.

🔒 Authentication Required

Response (200 OK)

{
  "status": "healthy",
  "connection_pool": {
    "active": 5,
    "idle": 10,
    "max": 20
  },
  "query_performance": {
    "avg_response_time_ms": 8.5,
    "slow_queries": 0
  },
  "disk_usage": {
    "size": "1.2 GB",
    "growth_rate": "50 MB/day"
  },
  "last_backup": "2024-01-01T02:00:00Z"
}

Metrics and Analytics

GET /api/v1/metrics/summary

Get aggregated metrics summary across the infrastructure.

🔒 Authentication Required

Query Parameters

timeframe (string) - Time period (hour, day, week, month)
metrics (array) - Specific metrics to include

Response (200 OK)

{
  "timeframe": "24h",
  "summary": {
    "total_hosts": 150,
    "avg_cpu_usage": 23.5,
    "avg_memory_usage": 67.2,
    "avg_disk_usage": 45.8,
    "total_commands_executed": 1247,
    "total_packages_managed": 356,
    "alerts_generated": 12,
    "uptime_percentage": 99.8
  },
  "trends": {
    "cpu_usage": "stable",
    "memory_usage": "increasing",
    "disk_usage": "stable",
    "network_traffic": "decreasing"
  }
}

GET /api/v1/metrics/hosts/{host_id}/history

Get historical metrics data for a specific host.

🔒 Authentication Required

Query Parameters

start_time (string) - Start timestamp (ISO 8601)
end_time (string) - End timestamp (ISO 8601)
interval (string) - Data interval (5m, 1h, 1d)

Response (200 OK)

{
  "host_id": "uuid",
  "interval": "1h",
  "data_points": [
    {
      "timestamp": "2024-01-01T11:00:00Z",
      "cpu_usage": 25.3,
      "memory_usage": 65.1,
      "disk_usage": 45.8,
      "network_rx": 1024,
      "network_tx": 2048
    }
  ]
}

Alert Management

GET /api/v1/alerts

Get list of active alerts.

🔒 Authentication Required

Query Parameters

severity (string) - Filter by severity (info, warning, critical)
status (string) - Filter by status (active, acknowledged, resolved)
host_id (string) - Filter by specific host

Response (200 OK)

[
  {
    "alert_id": "uuid",
    "severity": "warning",
    "title": "High disk usage",
    "description": "Disk usage on /var is 85%",
    "host_id": "uuid",
    "hostname": "web-01",
    "triggered_at": "2024-01-01T12:00:00Z",
    "status": "active",
    "metric": "disk_usage",
    "threshold": 80,
    "current_value": 85
  }
]

POST /api/v1/alerts/{alert_id}/acknowledge

Acknowledge an alert.

🔒 Authentication Required

Request Body

{
  "message": "Investigating high disk usage",
  "acknowledged_by": "admin"
}

Response (200 OK)

{
  "alert_id": "uuid",
  "status": "acknowledged",
  "acknowledged_at": "2024-01-01T12:05:00Z",
  "acknowledged_by": "admin"
}

Important Notes

Diagnostic collections can be resource-intensive - schedule during low-usage periods
Failed queue messages indicate potential system issues requiring attention
Regular health checks help identify issues before they become critical
Historical metrics data is retained for 90 days by default
Alert thresholds can be customized through configuration
Email configuration is required for alert notifications

Monitoring API

Overview

Diagnostics Collection

Path Parameters

Request Body

Response (200 OK)

Response (200 OK)

Response (200 OK)

Response (200 OK)

Response (200 OK)

Queue Monitoring

Query Parameters

Response (200 OK)

Response (200 OK)

Request Body (Optional)

Response (200 OK)

Security Monitoring

Response (200 OK)

Email Configuration Monitoring

Response (200 OK)

Request Body

Response (200 OK)

System Health Checks

Response (200 OK)

Response (200 OK)

Metrics and Analytics

Query Parameters

Response (200 OK)

Query Parameters

Response (200 OK)

Alert Management

Query Parameters

Response (200 OK)

Request Body

Response (200 OK)

Important Notes

Quick Navigation