Monitoring API
System monitoring, metrics collection, and alerting through programmatic interfaces.
Overview
The Monitoring API provides comprehensive system monitoring capabilities including real-time metrics collection, health checks, alert management, and historical data analysis across your entire infrastructure.
Diagnostics Collection
/api/v1/host/{host_id}/collect-diagnostics
Initiate comprehensive diagnostic data collection for a specific host.
Path Parameters
host_id
(string) - UUID of the target host
Request Body
{
"include_logs": true,
"include_processes": true,
"include_network": true,
"include_storage": true,
"log_lines": 1000,
"custom_commands": [
"systemctl status",
"df -h",
"free -m"
]
}
Response (200 OK)
{
"diagnostic_id": "uuid",
"status": "initiated",
"estimated_completion": "2024-01-01T12:05:00Z",
"collection_size_estimate": "2.5 MB"
}
/api/v1/host/{host_id}/diagnostics
Get list of diagnostic collections for a specific host.
Response (200 OK)
[
{
"diagnostic_id": "uuid",
"collected_at": "2024-01-01T12:00:00Z",
"status": "completed",
"size": "2.3 MB",
"components": [
"system_info",
"processes",
"network",
"storage",
"logs"
]
}
]
/api/v1/diagnostic/{diagnostic_id}
Get detailed diagnostic data for a specific collection.
Response (200 OK)
{
"diagnostic_id": "uuid",
"host_id": "uuid",
"collected_at": "2024-01-01T12:00:00Z",
"status": "completed",
"data": {
"system_info": {
"hostname": "web-01",
"uptime": "15 days, 3:22:45",
"load_average": [1.2, 1.5, 1.8],
"cpu_cores": 4,
"memory_total": "8 GB",
"disk_usage": [
{
"mount": "/",
"used": "45%",
"available": "55%"
}
]
},
"processes": [
{
"pid": 1234,
"name": "nginx",
"cpu_percent": 2.3,
"memory_percent": 1.8,
"status": "running"
}
],
"network": {
"interfaces": [
{
"name": "eth0",
"ip": "192.168.1.100",
"status": "up",
"rx_bytes": 1234567890,
"tx_bytes": 987654321
}
]
}
}
}
/api/v1/diagnostic/{diagnostic_id}/status
Get status of a diagnostic collection operation.
Response (200 OK)
{
"diagnostic_id": "uuid",
"status": "in_progress",
"progress": 65,
"current_step": "collecting_logs",
"estimated_completion": "2024-01-01T12:03:00Z",
"error": null
}
/api/v1/diagnostic/{diagnostic_id}
Delete a diagnostic collection and its data.
Response (200 OK)
{
"message": "Diagnostic collection deleted successfully",
"diagnostic_id": "uuid"
}
Queue Monitoring
/api/v1/queue/failed
Get list of failed queue messages for monitoring and debugging.
Query Parameters
limit
(integer, optional) - Maximum results (default: 50)offset
(integer, optional) - Results offset (default: 0)
Response (200 OK)
[
{
"message_id": "uuid",
"queue_name": "host_commands",
"message_type": "command_execution",
"failed_at": "2024-01-01T12:00:00Z",
"retry_count": 3,
"error": "Connection timeout to host",
"payload": {
"host_id": "uuid",
"command": "systemctl restart nginx"
}
}
]
/api/v1/queue/failed/{message_id}
Get detailed information about a specific failed message.
Response (200 OK)
{
"message_id": "uuid",
"queue_name": "host_commands",
"message_type": "command_execution",
"created_at": "2024-01-01T11:55:00Z",
"failed_at": "2024-01-01T12:00:00Z",
"retry_count": 3,
"max_retries": 3,
"error": "Connection timeout to host web-01",
"error_stack": "...",
"payload": {
"host_id": "uuid",
"command": "systemctl restart nginx",
"timeout": 30
},
"retry_history": [
{
"attempt": 1,
"failed_at": "2024-01-01T11:56:00Z",
"error": "Connection timeout"
}
]
}
/api/v1/queue/failed
Clear all failed messages from the queue.
Request Body (Optional)
{
"older_than": "2024-01-01T00:00:00Z",
"queue_name": "host_commands"
}
Response (200 OK)
{
"message": "Failed messages cleared",
"count": 25
}
Security Monitoring
/api/v1/security/default-credentials-status
Check if the system is using default credentials (security monitoring).
Response (200 OK)
{
"using_default_credentials": false,
"last_password_change": "2024-01-01T10:00:00Z",
"admin_users_count": 3,
"users_with_default_passwords": [],
"security_score": 95,
"recommendations": [
"Enable two-factor authentication",
"Rotate API keys monthly"
]
}
Email Configuration Monitoring
/api/v1/email/config
Get email configuration status for monitoring alert delivery.
Response (200 OK)
{
"configured": true,
"smtp_server": "smtp.example.com",
"smtp_port": 587,
"use_tls": true,
"from_address": "alerts@example.com",
"last_test": "2024-01-01T10:00:00Z",
"test_status": "success"
}
/api/v1/email/test
Test email configuration by sending a test email.
Request Body
{
"to_address": "admin@example.com",
"subject": "SysManage Email Test"
}
Response (200 OK)
{
"success": true,
"message": "Test email sent successfully",
"sent_at": "2024-01-01T12:00:00Z"
}
System Health Checks
/api/v1/health
Get overall system health status.
Response (200 OK)
{
"status": "healthy",
"timestamp": "2024-01-01T12:00:00Z",
"version": "1.0.0",
"components": {
"database": {
"status": "healthy",
"response_time_ms": 12
},
"websocket": {
"status": "healthy",
"active_connections": 25
},
"queue": {
"status": "healthy",
"pending_messages": 3,
"failed_messages": 0
},
"agents": {
"status": "degraded",
"online": 148,
"offline": 2,
"total": 150
}
}
}
/api/v1/health/database
Get detailed database health information.
Response (200 OK)
{
"status": "healthy",
"connection_pool": {
"active": 5,
"idle": 10,
"max": 20
},
"query_performance": {
"avg_response_time_ms": 8.5,
"slow_queries": 0
},
"disk_usage": {
"size": "1.2 GB",
"growth_rate": "50 MB/day"
},
"last_backup": "2024-01-01T02:00:00Z"
}
Metrics and Analytics
/api/v1/metrics/summary
Get aggregated metrics summary across the infrastructure.
Query Parameters
timeframe
(string) - Time period (hour, day, week, month)metrics
(array) - Specific metrics to include
Response (200 OK)
{
"timeframe": "24h",
"summary": {
"total_hosts": 150,
"avg_cpu_usage": 23.5,
"avg_memory_usage": 67.2,
"avg_disk_usage": 45.8,
"total_commands_executed": 1247,
"total_packages_managed": 356,
"alerts_generated": 12,
"uptime_percentage": 99.8
},
"trends": {
"cpu_usage": "stable",
"memory_usage": "increasing",
"disk_usage": "stable",
"network_traffic": "decreasing"
}
}
/api/v1/metrics/hosts/{host_id}/history
Get historical metrics data for a specific host.
Query Parameters
start_time
(string) - Start timestamp (ISO 8601)end_time
(string) - End timestamp (ISO 8601)interval
(string) - Data interval (5m, 1h, 1d)
Response (200 OK)
{
"host_id": "uuid",
"interval": "1h",
"data_points": [
{
"timestamp": "2024-01-01T11:00:00Z",
"cpu_usage": 25.3,
"memory_usage": 65.1,
"disk_usage": 45.8,
"network_rx": 1024,
"network_tx": 2048
}
]
}
Alert Management
/api/v1/alerts
Get list of active alerts.
Query Parameters
severity
(string) - Filter by severity (info, warning, critical)status
(string) - Filter by status (active, acknowledged, resolved)host_id
(string) - Filter by specific host
Response (200 OK)
[
{
"alert_id": "uuid",
"severity": "warning",
"title": "High disk usage",
"description": "Disk usage on /var is 85%",
"host_id": "uuid",
"hostname": "web-01",
"triggered_at": "2024-01-01T12:00:00Z",
"status": "active",
"metric": "disk_usage",
"threshold": 80,
"current_value": 85
}
]
/api/v1/alerts/{alert_id}/acknowledge
Acknowledge an alert.
Request Body
{
"message": "Investigating high disk usage",
"acknowledged_by": "admin"
}
Response (200 OK)
{
"alert_id": "uuid",
"status": "acknowledged",
"acknowledged_at": "2024-01-01T12:05:00Z",
"acknowledged_by": "admin"
}
Important Notes
- Diagnostic collections can be resource-intensive - schedule during low-usage periods
- Failed queue messages indicate potential system issues requiring attention
- Regular health checks help identify issues before they become critical
- Historical metrics data is retained for 90 days by default
- Alert thresholds can be customized through configuration
- Email configuration is required for alert notifications