Documentation > Professional+ > Observability Engine
⭐ PRO+

Observability Engine

Fleet-wide telemetry: OpenTelemetry collector deployment + lifecycle, Graylog log-forwarder sidecar attachment, and Grafana provisioning. Generates per-platform deployment plans (systemd / launchd / sysvinit / Windows services / FreeBSD rc.d) so a single server-side configuration rolls out consistently across mixed OS fleets.

Overview

The Observability Engine owns the deployment, configuration, and routing of three telemetry surfaces: the OpenTelemetry collector running on each agent (metrics + traces), the Graylog log-forwarder sidecar (syslog / file / journald shipping), and Grafana (dashboards + datasources auto-provisioned for SysManage's Postgres data). Per-host plans are generated from server-side templates so the agent never has to know how to configure these services.

Tier & Licensing

  • Enterprise tier: full observability deployment.
  • Community Edition: read-only telemetry status (is OTEL running? is Graylog attached?). No deploy / configure.

Telemetry Surfaces

OpenTelemetry Collector

A per-host OTEL collector that scrapes node-level metrics (CPU, memory, disk, network), forwards traces from instrumented apps, and ships them to a configurable backend. The engine generates platform-specific install bundles — binary + config + service unit — for Linux (systemd), FreeBSD (rc.d), OpenBSD (rc.d), NetBSD (rc.d), macOS (launchd), and Windows (Windows service).

Graylog Log Forwarder

A lightweight log-shipper sidecar that attaches to a Graylog input. Sources are configurable per host: syslog files, journald, application log files, or Windows Event Log. Detach + re-attach without restarting the SysManage agent.

Grafana Provisioning

Connect a Grafana instance and the engine auto-provisions its Postgres datasource (read-only credentials), an admin folder, and a starter dashboard pack: host inventory, package compliance, vulnerability status, agent connectivity, and update / OS-upgrade activity. Custom dashboards added through Grafana's UI are not touched by re-provisioning.

Open Source vs Professional+

  • Community Edition: agent reports OTEL service status and Graylog attach state via existing status messages so the host details page shows the current connectivity.
  • Professional+: deploy, start, stop, restart, and uninstall the OTEL collector across the fleet from one server-side configuration.
  • Professional+: attach hosts to a Graylog input with one click; configure source paths and TLS per host.
  • Professional+: connect Grafana, auto-provision the SysManage datasource + dashboards.

OTEL Deploy Flow

  1. User selects target hosts (singly or via fleet selector).
  2. User picks the backend (e.g. OTLP HTTP endpoint, OTLP gRPC, Prometheus remote-write).
  3. Engine generates the per-platform deployment plan: download the otelcol binary, drop the rendered config.yml in /etc/otelcol/ (or platform equivalent), install the service unit, enable + start.
  4. Agent runs the plan via apply_deployment_plan; per-step result rolls up into deployment status.
  5. Status sweep runs every minute and refreshes the per-host OTEL state; the host details page renders it.

Using the UI

From a host's Host Details page:

  • Deploy OpenTelemetry action triggers the per-host install plan; Start / Restart / Stop OpenTelemetry Service drive lifecycle.
  • Connect Host to Graylog opens the Graylog attach dialog; the engine plan installs and configures the log forwarder.
  • Enable Grafana Integration on Settings configures the datasource + provisions dashboards.

API Endpoints

All endpoints live under the /api/v1/observability prefix. Auth: bearer JWT. Each engine-driven endpoint also requires the corresponding Pro+ feature flag to be present in the active license; without it the route returns HTTP 402.

OpenTelemetry collector

  • POST /api/v1/observability/otel/{host_id}/status — queue a status-check plan against the host's OTEL collector.
  • POST /api/v1/observability/otel/{host_id}/deploy — deploy / re-deploy the collector config (assumes the binary is already installed on the host).
  • POST /api/v1/observability/otel/{host_id}/remove — stop, disable, and remove the OTEL config on a host.
  • POST /api/v1/observability/otel/{host_id}/multiplatform-deploy — full install + configure + start for any of seven platforms (linux_apt, linux_dnf, freebsd, openbsd, netbsd, macos, windows). Body: {platform, grafana_url}. FreeBSD installs Grafana Alloy (River-DSL config) instead of otelcol-contrib; every other platform installs the upstream otelcol-contrib via its native package manager.
  • POST /api/v1/observability/otel/{host_id}/{platform}/multiplatform-remove — symmetric: stop service, uninstall package, remove config dir (Windows skips the explicit config-dir cleanup because Chocolatey owns it).
  • Service controlPOST /hosts/{host_id}/opentelemetry/{start,stop,restart} are the OSS endpoints. They route through try_engine_otel_service_control in backend/services/observability_shim.py first; when the Pro+ engine is loaded each emits a build_otel_service_control_plan(platform, action) sequence (systemctl / service / rcctl / /etc/rc.d / brew services / sc.exe depending on platform). The engine plan goes through the agent's apply_deployment_plan path. Falls back to the legacy generic_command WS dispatch when the engine isn't loaded.
  • Grafana connect / disconnectPOST /hosts/{host_id}/opentelemetry/{connect,disconnect}. Engine-routed via build_otel_grafana_connection_plan(platform, action, grafana_url) which emits a restart-only plan (the OTEL config was pinned to the target Grafana at deploy time; connect/disconnect just triggers a service restart so any out-of-band edits take effect and the audit trail records the operator's intent).

Graylog deployment

  • POST /api/v1/observability/graylog/{host_id}/deploy — deploy the Graylog Sidecar config (Linux + Windows).
  • POST /api/v1/observability/graylog/{host_id}/status — queue a sidecar status check.
  • POST /api/v1/observability/graylog/{host_id}/{platform}/remove — tear the sidecar down.
  • POST /api/v1/observability/graylog/{host_id}/rsyslog-deploy — configure rsyslog to forward to Graylog. Body: {graylog_server, port, mechanism} where mechanism is syslog_tcp, syslog_udp, or gelf_tcp. Writes /etc/rsyslog.d/60-graylog.conf and restarts the daemon.
  • POST /api/v1/observability/graylog/{host_id}/syslog-ng-deploy — configure syslog-ng with the same three forwarder mechanisms. Writes /etc/syslog-ng/conf.d/60-graylog.conf.
  • POST /api/v1/observability/graylog/{host_id}/bsd-syslog-deploy — configure BSD stock syslogd. Body adds {bsd_variant, existing_config}: caller fetches the host's current /etc/syslog.conf and passes it as existing_config; the engine performs an idempotent merge that preserves unrelated forwarding rules. bsd_variant selects the restart command (service syslogd restart on FreeBSD, rcctl restart syslogd on OpenBSD/NetBSD). GELF is rejected at the validator (stock BSD syslogd can't speak it).
  • POST /api/v1/observability/graylog/{host_id}/sidecar-install — Windows-only. Downloads the upstream graylog-sidecar-installer-{amd64|386}.exe from Graylog's GitHub releases and runs it silently (/S). Pair with the deploy endpoint above to drop sidecar.yml once the binary is present.
  • Linux autodetectbuild_graylog_linux_autodetect_plan stages both rsyslog and syslog-ng configs to /tmp/sysmanage-graylog-{rsyslog,syslog-ng}.conf and emits per-daemon conditional shell branches gated on systemctl is-active --quiet <daemon>. The active daemon gets its config copied into place and restarted; the other branch is a no-op. Removes the need for the OSS server to know which forwarder the host runs.
  • BSD execute-time mergebuild_graylog_bsd_syslog_append_plan does the /etc/syslog.conf merge at agent execute-time via shell (sed -i.bak strip + printf append + variant-specific restart). Sidesteps the need for an OSS-side file-fetch primitive: the BSD endpoint dispatches off host.platform alone, no existing_config required.
  • Windows sidecar no-token installbuild_graylog_sidecar_no_token_plan handles the OSS POST /host/{id}/attach_to_graylog path on Windows. The OSS payload only carries {mechanism, graylog_server, port} — no api_token/node_id — so the plan writes sidecar.yml with an empty server_api_token (operator fills it in via the Sidecar admin UI afterwards). PowerShell-driven: detects ARM64 hosts and refuses (upstream has no ARM64 build), checks for an existing install via Test-Path, downloads the AMD64 installer from GitHub if missing, runs /S, then registers + starts the Windows service.

Grafana / telemetry routing

  • POST /api/v1/observability/grafana/{host_id}/provision — upload datasources + dashboards via the Grafana HTTP API on behalf of the server.
  • POST /api/v1/observability/routing/{host_id}/apply — apply per-host telemetry routing rules; merges with the host's base OTEL pipelines.

Dispatch model

OSS observability endpoints (deploy, remove, start/stop/restart, Grafana connect/disconnect, Graylog attach) follow a consistent engine-first / legacy-fallback contract implemented in backend/services/observability_shim.py:

  1. Endpoint calls a try_engine_* helper.
  2. Helper checks the loaded engine has the expected plan-builder attribute (defensive against version mismatches).
  3. Helper resolves the platform token via _detect_otel_platform (for OTEL paths) or directly from host.platform (for Graylog).
  4. On any miss (engine not loaded, missing builder, undetectable platform, exception during plan build or enqueue) the helper returns None; the endpoint queues the legacy generic_command WS message exactly as before.
  5. Every endpoint records the chosen path (engine_plan vs legacy_ws_command) in its AuditService.log entry under details.dispatch_path, so operators can audit production traffic.

On Linux the platform-detection function consults two signals before falling back to a default: (1) the host's installed-package inventory (most authoritative — one apt/dnf/yum entry decides it); (2) the OS_INFO collection's platform_release string, which lands within ~1s of first connect (closes the fresh-host race for typical distros — Ubuntu, Debian, RHEL, Rocky, Fedora, etc. all match via substring keyword). When neither signal is available the function defaults to linux_apt with a WARNING log line; a dnf-family host hitting that default will see a clear apt-get: command not found error rather than a silent legacy-fallback to soon-to-be-deleted agent code.

Required Permissions

  • Deploy OpenTelemetry, Start OpenTelemetry Service, Stop OpenTelemetry Service, Restart OpenTelemetry Service
  • Connect Host to Graylog, Enable Graylog Integration
  • Enable Grafana Integration

Troubleshooting

  • If the host shows OTEL as not running after a successful deploy, the engine's status sweep takes up to ~60s to refresh; force a re-check via the host details page's refresh action.
  • Graylog over TLS requires the agent host's CA bundle to trust the Graylog certificate; the attach dialog accepts a custom CA file path which the engine deploys to the agent.
  • Grafana provisioning uses the read-only Postgres credentials configured under observability.grafana.datasource in /etc/sysmanage.yaml; rotate via the Grafana datasource UI rather than re-running the connect flow.