Observability Engine
Fleet-wide telemetry: OpenTelemetry collector deployment + lifecycle, Graylog log-forwarder sidecar attachment, and Grafana provisioning. Generates per-platform deployment plans (systemd / launchd / sysvinit / Windows services / FreeBSD rc.d) so a single server-side configuration rolls out consistently across mixed OS fleets.
Overview
The Observability Engine owns the deployment, configuration, and routing of three telemetry surfaces: the OpenTelemetry collector running on each agent (metrics + traces), the Graylog log-forwarder sidecar (syslog / file / journald shipping), and Grafana (dashboards + datasources auto-provisioned for SysManage's Postgres data). Per-host plans are generated from server-side templates so the agent never has to know how to configure these services.
Tier & Licensing
- Enterprise tier: full observability deployment.
- Community Edition: read-only telemetry status (is OTEL running? is Graylog attached?). No deploy / configure.
Telemetry Surfaces
OpenTelemetry Collector
A per-host OTEL collector that scrapes node-level metrics (CPU, memory, disk, network), forwards traces from instrumented apps, and ships them to a configurable backend. The engine generates platform-specific install bundles — binary + config + service unit — for Linux (systemd), FreeBSD (rc.d), OpenBSD (rc.d), NetBSD (rc.d), macOS (launchd), and Windows (Windows service).
Graylog Log Forwarder
A lightweight log-shipper sidecar that attaches to a Graylog input. Sources are configurable per host: syslog files, journald, application log files, or Windows Event Log. Detach + re-attach without restarting the SysManage agent.
Grafana Provisioning
Connect a Grafana instance and the engine auto-provisions its Postgres datasource (read-only credentials), an admin folder, and a starter dashboard pack: host inventory, package compliance, vulnerability status, agent connectivity, and update / OS-upgrade activity. Custom dashboards added through Grafana's UI are not touched by re-provisioning.
Open Source vs Professional+
- Community Edition: agent reports OTEL service status and Graylog attach state via existing status messages so the host details page shows the current connectivity.
- Professional+: deploy, start, stop, restart, and uninstall the OTEL collector across the fleet from one server-side configuration.
- Professional+: attach hosts to a Graylog input with one click; configure source paths and TLS per host.
- Professional+: connect Grafana, auto-provision the SysManage datasource + dashboards.
OTEL Deploy Flow
- User selects target hosts (singly or via fleet selector).
- User picks the backend (e.g. OTLP HTTP endpoint, OTLP gRPC, Prometheus remote-write).
- Engine generates the per-platform deployment plan: download the otelcol binary, drop the rendered
config.ymlin/etc/otelcol/(or platform equivalent), install the service unit, enable + start. - Agent runs the plan via
apply_deployment_plan; per-step result rolls up into deployment status. - Status sweep runs every minute and refreshes the per-host OTEL state; the host details page renders it.
Using the UI
From a host's Host Details page:
- Deploy OpenTelemetry action triggers the per-host install plan; Start / Restart / Stop OpenTelemetry Service drive lifecycle.
- Connect Host to Graylog opens the Graylog attach dialog; the engine plan installs and configures the log forwarder.
- Enable Grafana Integration on Settings configures the datasource + provisions dashboards.
API Endpoints
All endpoints live under the /api/v1/observability prefix. Auth: bearer JWT. Each engine-driven endpoint also requires the corresponding Pro+ feature flag to be present in the active license; without it the route returns HTTP 402.
OpenTelemetry collector
POST /api/v1/observability/otel/{host_id}/status— queue a status-check plan against the host's OTEL collector.POST /api/v1/observability/otel/{host_id}/deploy— deploy / re-deploy the collector config (assumes the binary is already installed on the host).POST /api/v1/observability/otel/{host_id}/remove— stop, disable, and remove the OTEL config on a host.POST /api/v1/observability/otel/{host_id}/multiplatform-deploy— full install + configure + start for any of seven platforms (linux_apt,linux_dnf,freebsd,openbsd,netbsd,macos,windows). Body:{platform, grafana_url}. FreeBSD installs Grafana Alloy (River-DSL config) instead of otelcol-contrib; every other platform installs the upstreamotelcol-contribvia its native package manager.POST /api/v1/observability/otel/{host_id}/{platform}/multiplatform-remove— symmetric: stop service, uninstall package, remove config dir (Windows skips the explicit config-dir cleanup because Chocolatey owns it).- Service control —
POST /hosts/{host_id}/opentelemetry/{start,stop,restart}are the OSS endpoints. They route throughtry_engine_otel_service_controlinbackend/services/observability_shim.pyfirst; when the Pro+ engine is loaded each emits abuild_otel_service_control_plan(platform, action)sequence (systemctl / service / rcctl / /etc/rc.d / brew services / sc.exe depending on platform). The engine plan goes through the agent'sapply_deployment_planpath. Falls back to the legacygeneric_commandWS dispatch when the engine isn't loaded. - Grafana connect / disconnect —
POST /hosts/{host_id}/opentelemetry/{connect,disconnect}. Engine-routed viabuild_otel_grafana_connection_plan(platform, action, grafana_url)which emits a restart-only plan (the OTEL config was pinned to the target Grafana at deploy time; connect/disconnect just triggers a service restart so any out-of-band edits take effect and the audit trail records the operator's intent).
Graylog deployment
POST /api/v1/observability/graylog/{host_id}/deploy— deploy the Graylog Sidecar config (Linux + Windows).POST /api/v1/observability/graylog/{host_id}/status— queue a sidecar status check.POST /api/v1/observability/graylog/{host_id}/{platform}/remove— tear the sidecar down.POST /api/v1/observability/graylog/{host_id}/rsyslog-deploy— configure rsyslog to forward to Graylog. Body:{graylog_server, port, mechanism}wheremechanismissyslog_tcp,syslog_udp, orgelf_tcp. Writes/etc/rsyslog.d/60-graylog.confand restarts the daemon.POST /api/v1/observability/graylog/{host_id}/syslog-ng-deploy— configure syslog-ng with the same three forwarder mechanisms. Writes/etc/syslog-ng/conf.d/60-graylog.conf.POST /api/v1/observability/graylog/{host_id}/bsd-syslog-deploy— configure BSD stock syslogd. Body adds{bsd_variant, existing_config}: caller fetches the host's current/etc/syslog.confand passes it asexisting_config; the engine performs an idempotent merge that preserves unrelated forwarding rules.bsd_variantselects the restart command (service syslogd restarton FreeBSD,rcctl restart syslogdon OpenBSD/NetBSD). GELF is rejected at the validator (stock BSD syslogd can't speak it).POST /api/v1/observability/graylog/{host_id}/sidecar-install— Windows-only. Downloads the upstreamgraylog-sidecar-installer-{amd64|386}.exefrom Graylog's GitHub releases and runs it silently (/S). Pair with the deploy endpoint above to dropsidecar.ymlonce the binary is present.- Linux autodetect —
build_graylog_linux_autodetect_planstages both rsyslog and syslog-ng configs to/tmp/sysmanage-graylog-{rsyslog,syslog-ng}.confand emits per-daemon conditional shell branches gated onsystemctl is-active --quiet <daemon>. The active daemon gets its config copied into place and restarted; the other branch is a no-op. Removes the need for the OSS server to know which forwarder the host runs. - BSD execute-time merge —
build_graylog_bsd_syslog_append_plandoes the/etc/syslog.confmerge at agent execute-time via shell (sed -i.bakstrip +printfappend + variant-specific restart). Sidesteps the need for an OSS-side file-fetch primitive: the BSD endpoint dispatches offhost.platformalone, noexisting_configrequired. - Windows sidecar no-token install —
build_graylog_sidecar_no_token_planhandles the OSSPOST /host/{id}/attach_to_graylogpath on Windows. The OSS payload only carries{mechanism, graylog_server, port}— noapi_token/node_id— so the plan writessidecar.ymlwith an emptyserver_api_token(operator fills it in via the Sidecar admin UI afterwards). PowerShell-driven: detects ARM64 hosts and refuses (upstream has no ARM64 build), checks for an existing install viaTest-Path, downloads the AMD64 installer from GitHub if missing, runs/S, then registers + starts the Windows service.
Grafana / telemetry routing
POST /api/v1/observability/grafana/{host_id}/provision— upload datasources + dashboards via the Grafana HTTP API on behalf of the server.POST /api/v1/observability/routing/{host_id}/apply— apply per-host telemetry routing rules; merges with the host's base OTEL pipelines.
Dispatch model
OSS observability endpoints (deploy, remove, start/stop/restart, Grafana connect/disconnect, Graylog attach) follow a consistent engine-first / legacy-fallback contract implemented in backend/services/observability_shim.py:
- Endpoint calls a
try_engine_*helper. - Helper checks the loaded engine has the expected plan-builder attribute (defensive against version mismatches).
- Helper resolves the platform token via
_detect_otel_platform(for OTEL paths) or directly fromhost.platform(for Graylog). - On any miss (engine not loaded, missing builder, undetectable platform, exception during plan build or enqueue) the helper returns
None; the endpoint queues the legacygeneric_commandWS message exactly as before. - Every endpoint records the chosen path (
engine_planvslegacy_ws_command) in itsAuditService.logentry underdetails.dispatch_path, so operators can audit production traffic.
On Linux the platform-detection function consults two signals before falling back to a default: (1) the host's installed-package inventory (most authoritative — one apt/dnf/yum entry decides it); (2) the OS_INFO collection's platform_release string, which lands within ~1s of first connect (closes the fresh-host race for typical distros — Ubuntu, Debian, RHEL, Rocky, Fedora, etc. all match via substring keyword). When neither signal is available the function defaults to linux_apt with a WARNING log line; a dnf-family host hitting that default will see a clear apt-get: command not found error rather than a silent legacy-fallback to soon-to-be-deleted agent code.
Required Permissions
Deploy OpenTelemetry,Start OpenTelemetry Service,Stop OpenTelemetry Service,Restart OpenTelemetry ServiceConnect Host to Graylog,Enable Graylog IntegrationEnable Grafana Integration
Troubleshooting
- If the host shows OTEL as not running after a successful deploy, the engine's status sweep takes up to ~60s to refresh; force a re-check via the host details page's refresh action.
- Graylog over TLS requires the agent host's CA bundle to trust the Graylog certificate; the attach dialog accepts a custom CA file path which the engine deploys to the agent.
- Grafana provisioning uses the read-only Postgres credentials configured under
observability.grafana.datasourcein/etc/sysmanage.yaml; rotate via the Grafana datasource UI rather than re-running the connect flow.