Health Monitoring rev 04

From Open-E Wiki
Revision as of 08:27, 24 April 2026 by Ai-B (talk | contribs) (Lowercase generic "Storage Server"; rewrite Reboot behavior bullet to match current per-service 120s stability-window clearer (was "three consecutive healthy checks").)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Health Monitoring integrates the Checkmk monitoring stack into the storage server as a self-contained LXC container. It collects metrics from the local storage server out-of-the-box and can be extended to monitor additional storage servers over the network. Notifications for state changes are delivered by email using the storage server's existing SMTP settings.

Note: Health Monitoring is delivered as an optional Small Update (the xc-checkmk module). This article describes revision 04. If your system was updated to a newer revision, refer to the matching Extension:Health_Monitoring_rev_NN article.

What's new in revision 04

Revision 04 is a major refresh of the container and a behavior change for several end-user-visible areas. Highlights:

  • Platform — container rebased on Debian 12 (bookworm). Checkmk version unchanged (2.1.0p49).
  • Login uses the storage server admin password — the admin user is authenticated via PAM against the same credentials you use on the storage server GUI. Changing the admin password on the storage server immediately applies here; no separate password management. The legacy cmkadmin user is preserved for backward compatibility.
  • Email uses the storage server's SMTP settings — notifications are sent through the same email gateway configured under System Settings → Administration → Email notifications. The To address is applied automatically via a placeholder (admin@localhost) that is rewritten at send time, so changes to the storage server's email settings take effect immediately without touching Checkmk.
  • Automatic service discovery — newly imported pools and their datasets appear in monitoring automatically (typically within 2 hours). Exported pools and removed plugins are cleaned up the same way. No manual Activate Changes click is required for discovered services.
  • Zero spurious notifications around reboots — Checkmk is placed in scheduled maintenance before the container stops, and the maintenance state is cleared at boot once services have settled. Pool-specific downtimes are cleared event-driven when each pool mounts.
  • No "flapping" emails — brief outages during reboots, pool imports, or network blips no longer generate flap notifications. Flap state is still visible in the GUI.

Accessing the Checkmk GUI

Once the storage server boots with the Health Monitoring Small Update installed, the Checkmk GUI is available at:

 https://<storage-server-ip>:4080/dssmonitor
  • Login: admin (same credentials as the storage server GUI)
  • Password: your storage server administrator password

The Checkmk web interface uses the same authentication as the storage server GUI — when you change your admin password on the storage server, it automatically applies to Checkmk as well. No separate password management is needed.

Note: a legacy cmkadmin user exists in the system for internal compatibility with Checkmk's management tooling (omd, cmk-passwd, automation). It has a random password and is not intended for interactive login. If you need to use it (e.g. for scripting), reset its password from the console: sudo su - dssmonitor -c "cmk-passwd cmkadmin".

Local storage server configuration

The local storage server is pre-configured in the Checkmk container as local-storage-server (in the storage-servers folder). It receives monitoring data from the local Checkmk agent.

Pre-configured monitoring rules

The default monitoring rules are located at:

 Setup → Agents → Other integrations → Individual program call instead of agent access

Two rules are pre-configured:

Rule 1: Local storage server (enabled)

  • Purpose: monitor the local storage server via a shared agent output file
  • Status: enabled by default
  • Explicit Hosts entry: ~local-storage-server — the tilde (~) makes the rule apply to every host name beginning with local-storage-server.
  • Customization: if you prefer a more descriptive host name (e.g. local-storage-server-220), create the new hostname and remove the default local-storage-server entry.

Rule 2: Remote storage server via REST API (disabled)

  • Purpose: monitor additional storage servers via the REST API
  • Status: disabled by default
  • Explicit Hosts entry: ~storage-server
  • Default credentials:
 rest_api_user=admin; rest_api_pswd=admin; rest_api_port=82;

Prerequisites on each remote storage server:

  1. Enable REST API under System Settings → Administration → REST API access.
  2. Update the command-line credentials above if the remote password differs from admin.

Step-by-step: add a remote storage server

  1. On the remote storage server, enable REST API access under System Settings → Administration → REST API access.
  2. In Checkmk, create a new host with a name starting with storage-server (e.g. storage-server-220). The REST API rule will apply automatically via the ~storage-server pattern.
  3. Update the credentials in the rule if the remote storage server uses different values.
  4. Run service discovery on the new host and verify services are discovered.

Explicit hosts configuration

Rules with tilde-prefixed entries (e.g. ~storage-server, ~local-storage-server) apply to every monitored host whose name begins with the given string:

  • storage-server-220, storage-server-221, …
  • local-storage-server-220, local-storage-server-221, …

Storage space and automatic cleanup

  • Allocated storage: approximately 4 GB is reserved for data collection.
  • Automatic cleanup: Checkmk removes older data when free space falls below 300 MB.

Console access

The Checkmk container provides a web-based terminal for administrative tasks such as migration, site management, and troubleshooting.

URL:

 https://<storage-server-ip>:4200/checkmk/

Login: enter admin as the username and your storage server administrator password.

After logging in, you are in a regular user shell. For most administrative tasks you need root access:

 sudo su -

From root, you can manage the Checkmk site directly:

 omd status                              # check site status
 omd stop dssmonitor                     # stop the site
 omd start dssmonitor                    # start the site
 su - dssmonitor -c "cmk -II local-storage-server && cmk -O"   # run discovery

Email notifications

Email notifications are automatically configured from the storage server's own email settings — no separate SMTP configuration is needed inside Checkmk.

How it works

  1. Configure email on the storage server: System Settings → Administration → Email notifications. Set the SMTP server, From address and To address.
  2. Once saved:
    • the From address is used as the sender for Checkmk notification emails;
    • the To address is used as the destination — Checkmk contacts are configured with a placeholder (admin@localhost) that the email gateway rewrites to your configured To address;
    • changes to either address take effect immediately; no Checkmk restart needed.
  3. Additional customization is available in the Checkmk GUI:
    • notification rules under Setup → Events → Notifications;
    • per-user email settings under Setup → Users.

Important notes

  • Email delivery uses the storage server's central SMTP proxy — the same gateway used by ownCloud, cron jobs, etc.
  • The default notification rule emails only for actual state changes (OK↔WARN↔CRIT↔UNKNOWN). Scheduled-downtime start/end and flapping events do not generate emails.
  • Changes you make in Setup → Events → Notifications (customising the default rule or adding your own rules) are preserved across reboots.
  • To send a test email, use Setup → Events → Notifications → Test notifications.

Automatic service discovery

Checkmk automatically manages the list of monitored services:

  • new services (e.g. from a newly imported pool) are added to monitoring within 2 hours;
  • vanished services (e.g. from an exported pool or a removed plugin) are cleaned up within 2 hours;
  • changes are activated automatically — no manual Activate Changes click needed for discovered services.

So, when you import a new ZFS pool, its capacity, health, snapshot age and compression services appear in Checkmk on their own. When you export a pool, the corresponding services are removed instead of lingering in UNKNOWN state. When a monitoring plugin is added or removed, services are updated accordingly.

Manual discovery

To avoid waiting for the automatic cycle, trigger discovery manually:

  • via the GUI: go to the host → Setup → Services → Full service scan
  • via the console: su - dssmonitor -c "cmk -II local-storage-server && cmk -O"

Reboot behavior

The monitoring container is designed to produce zero spurious notification emails during system reboots:

  • At shutdown: Checkmk automatically enters a scheduled maintenance mode before the container stops, so no service unavailable alerts fire while the system is going down. The maintenance state persists regardless of how long the system stays powered off (overnight, weekend, extended maintenance).
  • At boot: downtimes clear automatically as services resume reporting. Each service's downtime is released once that service has been reporting steadily (any state except UNKNOWN) for at least two minutes, so brief UNKNOWN flickers during early boot do not escape as notifications. Services that stay UNKNOWN — for example an unmounted pool or a stale monitoring entry — keep their downtime, so no notification flood can occur. Pool-specific downtimes are additionally cleared the moment each pool mounts.
  • Boot ordering: the container starts only after ZFS pools are imported and the monitoring agent has collected fresh data, so the first check cycle sees everything already in its normal state.

If a pool fails to import after a reboot, its monitoring services stay in scheduled maintenance — the not-mounted state is visible directly in the storage server GUI, which is the authoritative source for this condition.

Renaming the site

The default monitoring site is named dssmonitor. If you prefer a different name (e.g. to match your host or organization):

  1. Open the Checkmk console at https://<ip>:4200/checkmk/
  2. Log in with admin and the storage server administrator password
  3. Run:
 sudo su -
 omd stop dssmonitor
 omd mv dssmonitor <newname>
 omd start <newname>

After renaming, the Checkmk GUI is accessible at https://<ip>:4080/<newname>.

Note: the automated email configuration and notification rule apply to the site active at the time setup_mailer runs. If you rename and then change email settings in the storage server GUI, the new site name is picked up automatically.

Migrating from a previous revision

If you upgraded from an earlier revision (e.g. rev 03), your Checkmk configuration (users, passwords, monitoring rules, views, dashboards) can be carried over to the new container.

The previous revision's data is stored on disk as an AUFS changes-layer image and remains available after the upgrade. The migration helper reads it, creates a standard Checkmk backup, and restores it into the current container using the built-in omd backup / omd restore mechanism.

Steps

  1. Open the Checkmk console at https://<ip>:4200/checkmk/
  2. Log in with admin and the storage server administrator password
  3. Run:
 sudo /usr/local/bin/migrate_from_prev_rev
  1. Wait for the script to complete. It will list each site it migrated.
  2. Log in to the Checkmk GUI — your old users, rules, and views should be present.

What is migrated

  • All Checkmk sites (including renamed sites or additional user-created sites)
  • WATO configuration: hosts, rules, contacts, notification rules, tags, groups
  • Users and passwords (cmkadmin password is preserved; admin is added with PAM auth)
  • Per-user preferences: sidebar layout, saved views, dashboards
  • Discovered service definitions (autochecks)

What is NOT migrated

  • Historical metric data (RRD files) and monitoring logs
  • Temporary files and runtime state
  • Backup job definitions (these were stored in volatile storage)

Re-running

The migration script runs at most once. To re-run (e.g. after a factory reset), remove the marker file:

 sudo rm /var/lib/xc-checkmk/migration-done
 sudo /usr/local/bin/migrate_from_prev_rev

Known issues

  • Two-Factor Authentication (2FA): if 2FA is enabled on the storage server (via the oe_2fa module), Checkmk login may not work correctly. The current authentication method (Apache basic auth) does not support the second-factor prompt. 2FA support for Checkmk is planned for a future revision.
  • Checkmk REST API: external access to the Checkmk REST API (e.g. for automation scripts using Bearer token authentication) is not supported in this revision. REST API support is planned for a future revision.

For further customization or troubleshooting, refer to the upstream Checkmk documentation or contact Open-E support.