Health Monitoring rev 04

From Open-E Wiki
Jump to navigation Jump to search
Ai-B (talk | contribs)
Add Known Issues section (2FA + REST API deferred to future rev)
Ai-B (talk | contribs)
Full page update: simplified to 2 rules (removed SSH CLI), expanded console section, added known issues, Checkmk branding
Line 26: Line 26:
The Checkmk web interface uses the same authentication as the Storage Server GUI — when you change your admin password on the Storage Server, it automatically applies to Checkmk as well. No separate password management is needed.
The Checkmk web interface uses the same authentication as the Storage Server GUI — when you change your admin password on the Storage Server, it automatically applies to Checkmk as well. No separate password management is needed.


''Note:'' the legacy '''cmkadmin''' user is also available for backward compatibility.
''Note:'' a legacy '''cmkadmin''' user exists in the system for internal compatibility with Checkmk's management tooling ('''omd''', '''cmk-passwd''', automation). It has a random password and is not intended for interactive login. If you need to use it (e.g. for scripting), reset its password from the console: <code>sudo su - dssmonitor -c "cmk-passwd cmkadmin"</code>.


== Local Storage Server configuration ==
== Local Storage Server configuration ==
Line 38: Line 38:
   Setup &rarr; Agents &rarr; Other integrations &rarr; Individual program call instead of agent access
   Setup &rarr; Agents &rarr; Other integrations &rarr; Individual program call instead of agent access


Three rules are pre-configured:
Two rules are pre-configured:


==== Rule 1: Local Storage Server (Enabled) ====
==== Rule 1: Local Storage Server (Enabled) ====


* '''Purpose:''' monitor the local Storage Server via SSH localhost
* '''Purpose:''' monitor the local Storage Server via a shared agent output file
* '''Status:''' enabled by default
* '''Status:''' enabled by default
* '''Explicit Hosts entry:''' <code>~local-storage-server</code> &mdash; the tilde (<code>~</code>) makes the rule apply to every host name beginning with ''local-storage-server''.
* '''Explicit Hosts entry:''' <code>~local-storage-server</code> &mdash; the tilde (<code>~</code>) makes the rule apply to every host name beginning with ''local-storage-server''.
* '''Customization:''' if you prefer a more descriptive host name (e.g. ''local-storage-server-220''), create the new hostname and remove the default ''local-storage-server'' entry.
* '''Customization:''' if you prefer a more descriptive host name (e.g. ''local-storage-server-220''), create the new hostname and remove the default ''local-storage-server'' entry.


==== Rule 2: Remote monitoring via REST API (Disabled) ====
==== Rule 2: Remote Storage Server via REST API (Disabled) ====


* '''Purpose:''' monitor additional Storage Servers via the REST API
* '''Purpose:''' monitor additional Storage Servers via the REST API
Line 58: Line 58:
# Enable REST API under ''System Settings &rarr; Administration &rarr; REST API access''.
# Enable REST API under ''System Settings &rarr; Administration &rarr; REST API access''.
# Update the command-line credentials above if the remote password differs from ''admin''.
# Update the command-line credentials above if the remote password differs from ''admin''.
==== Rule 3: Remote monitoring via SSH CLI (Enabled) ====
* '''Purpose:''' monitor additional Storage Servers using the SSH CLI command ''check_mk_agent''
* '''Status:''' enabled by default
* '''Explicit Hosts entry:''' <code>~storage-server</code>
* '''Default credentials:'''
  rest_api_user=admin; rest_api_pswd=admin; rest_api_port=82; cli_port=22223;
Prerequisites on each remote Storage Server:
# Enable CLI access under ''System Settings &rarr; Administration &rarr; CLI access'' and click '''Generate and download''' to retrieve the SSH key.
# Enable the REST API (required for the initial SSH-key download) under ''System Settings &rarr; Administration &rarr; REST API access''.
# Update the command-line credentials if the remote Storage Server uses different settings.


=== Step-by-step: add a remote Storage Server ===
=== Step-by-step: add a remote Storage Server ===


# On the remote Storage Server, enable '''CLI access''' and download the SSH key; enable '''REST API access'''.
# On the remote Storage Server, enable '''REST API access''' under ''System Settings &rarr; Administration &rarr; REST API access''.
# In Checkmk, create a new host with a name starting with ''storage-server'' (e.g. ''storage-server-220''). The SSH-CLI rule will apply automatically via the <code>~storage-server</code> pattern.
# In Checkmk, create a new host with a name starting with ''storage-server'' (e.g. ''storage-server-220''). The REST API rule will apply automatically via the <code>~storage-server</code> pattern.
# Update the credentials in the rule if the remote Storage Server uses different values.
# Update the credentials in the rule if the remote Storage Server uses different values.
# Run service discovery on the new host and verify services are discovered.
# Run service discovery on the new host and verify services are discovered.
Line 93: Line 80:
== Console access ==
== Console access ==


The console of the container is reachable at:
The Checkmk container provides a web-based terminal for administrative tasks such as migration, site management, and troubleshooting.
 
'''URL:'''
  https://&lt;storage-server-ip&gt;:4200/checkmk/
 
'''Login:''' enter '''admin''' as the username and your Storage Server administrator password.
 
After logging in, you are in a regular user shell. For most administrative tasks you need root access:


   https://&lt;ip&gt;:4200/checkmk/
   sudo su -


Log in as '''admin''' with the Storage Server administrator password.
From root, you can manage the Checkmk site directly:
 
  omd status                              # check site status
  omd stop dssmonitor                    # stop the site
  omd start dssmonitor                    # start the site
  su - dssmonitor -c "cmk -II local-storage-server && cmk -O"  # run discovery


== Email notifications ==
== Email notifications ==
Line 135: Line 134:


* via the GUI: go to the host &rarr; ''Setup &rarr; Services &rarr; Full service scan''
* via the GUI: go to the host &rarr; ''Setup &rarr; Services &rarr; Full service scan''
* via the console: <code>su - dssmonitor -c &quot;cmk -II local-storage-server &amp;&amp; cmk -O&quot;</code>
* via the console: <code>su - dssmonitor -c "cmk -II local-storage-server && cmk -O"</code>


== Reboot behavior ==
== Reboot behavior ==
Line 147: Line 146:
If a real problem persists after the boot window (e.g. a pool fails to import), it is reported normally once the maintenance period for that pool expires.
If a real problem persists after the boot window (e.g. a pool fails to import), it is reported normally once the maintenance period for that pool expires.


----
== Renaming the site ==
 
The default monitoring site is named '''dssmonitor'''. If you prefer a different name (e.g. to match your host or organization):
 
# Open the Checkmk console at <code>https://&lt;ip&gt;:4200/checkmk/</code>
# Log in with '''admin''' and the Storage Server administrator password
# Run:
  sudo su -
  omd stop dssmonitor
  omd mv dssmonitor &lt;newname&gt;
  omd start &lt;newname&gt;
 
After renaming, the Checkmk GUI is accessible at <code>https://&lt;ip&gt;:4080/&lt;newname&gt;</code>.
 
'''Note:''' the automated email configuration and notification rule apply to the site active at the time '''setup_mailer''' runs. If you rename and then change email settings in the Storage Server GUI, the new site name is picked up automatically.
 
== Migrating from a previous revision ==
 
If you upgraded from an earlier revision (e.g. rev 03), your Checkmk configuration (users, passwords, monitoring rules, views, dashboards) can be carried over to the new container.
 
The previous revision's data is stored on disk as an AUFS changes-layer image and remains available after the upgrade. The migration helper reads it, creates a standard Checkmk backup, and restores it into the current container using the built-in '''omd backup''' / '''omd restore''' mechanism.
 
=== Steps ===
 
# Open the Checkmk console at <code>https://&lt;ip&gt;:4200/checkmk/</code>
# Log in with '''admin''' and the Storage Server administrator password
# Run:
  sudo /usr/local/bin/migrate_from_prev_rev
# Wait for the script to complete. It will list each site it migrated.
# Log in to the Checkmk GUI &mdash; your old users, rules, and views should be present.
 
=== What is migrated ===
 
* All Checkmk sites (including renamed sites or additional user-created sites)
* WATO configuration: hosts, rules, contacts, notification rules, tags, groups
* Users and passwords ('''cmkadmin''' password is preserved; '''admin''' is added with PAM auth)
* Per-user preferences: sidebar layout, saved views, dashboards
* Discovered service definitions (autochecks)
 
=== What is NOT migrated ===
 
* Historical metric data (RRD files) and monitoring logs
* Temporary files and runtime state
* Backup job definitions (these were stored in volatile storage)
 
=== Re-running ===


For further customization or troubleshooting, refer to the [https://docs.checkmk.com/2.1.0/en/ upstream Checkmk documentation] or contact Open-E support.
The migration script runs at most once. To re-run (e.g. after a factory reset), remove the marker file:


[[Category:Help topics]]
  sudo rm /var/lib/xc-checkmk/migration-done
[[Category:Extensions]]
  sudo /usr/local/bin/migrate_from_prev_rev


== Known issues ==
== Known issues ==
Line 158: Line 202:
* '''Two-Factor Authentication (2FA):''' if 2FA is enabled on the Storage Server (via the oe_2fa module), Checkmk login may not work correctly. The current authentication method (Apache basic auth) does not support the second-factor prompt. 2FA support for Checkmk is planned for a future revision.
* '''Two-Factor Authentication (2FA):''' if 2FA is enabled on the Storage Server (via the oe_2fa module), Checkmk login may not work correctly. The current authentication method (Apache basic auth) does not support the second-factor prompt. 2FA support for Checkmk is planned for a future revision.
* '''Checkmk REST API:''' external access to the Checkmk REST API (e.g. for automation scripts using Bearer token authentication) is not supported in this revision. REST API support is planned for a future revision.
* '''Checkmk REST API:''' external access to the Checkmk REST API (e.g. for automation scripts using Bearer token authentication) is not supported in this revision. REST API support is planned for a future revision.
----
For further customization or troubleshooting, refer to the [https://docs.checkmk.com/2.1.0/en/ upstream Checkmk documentation] or contact Open-E support.
[[Category:Help topics]]
[[Category:Extensions]]

Revision as of 15:59, 16 April 2026

Health Monitoring integrates the Checkmk monitoring stack into JovianDSS as a self-contained LXC container. It collects metrics from the local Storage Server out-of-the-box and can be extended to monitor additional Storage Servers over the network. Notifications for state changes are delivered by email using the Storage Server's existing SMTP settings.

Note: Health Monitoring is delivered as an optional Small Update (the xc-checkmk module). This article describes revision 04. If your system was updated to a newer revision, refer to the matching Extension:Health_Monitoring_rev_NN article.

What's new in revision 04

Revision 04 is a major refresh of the container and a behavior change for several end-user-visible areas. Highlights:

  • Platform — container rebased on Debian 12 (bookworm); Checkmk updated to 2.1.0p49.
  • Login uses the Storage Server admin password — the admin user is authenticated via PAM against the same credentials you use on the Storage Server GUI. Changing the admin password on the Storage Server immediately applies here; no separate password management. The legacy cmkadmin user is preserved for backward compatibility.
  • Email uses the Storage Server's SMTP settings — notifications are sent through the same email gateway configured under System Settings → Administration → Email notifications. The To address is applied automatically via a placeholder (admin@localhost) that is rewritten at send time, so changes to the Storage Server's email settings take effect immediately without touching Checkmk.
  • Automatic service discovery — newly imported pools and their datasets appear in monitoring automatically (typically within 2 hours). Exported pools and removed plugins are cleaned up the same way. No manual Activate Changes click is required for discovered services.
  • Zero spurious notifications around reboots — Checkmk is placed in scheduled maintenance before the container stops, and the maintenance state is cleared at boot once services have settled. Pool-specific downtimes are cleared event-driven when each pool mounts, so a pool that fails to import still produces a legitimate alert once the downtime window expires.
  • No "flapping" emails — brief outages during reboots, pool imports, or network blips no longer generate flap notifications. Flap state is still visible in the GUI.

Accessing the Checkmk GUI

Once the Storage Server boots with the Health Monitoring Small Update installed, the Checkmk GUI is available at:

 https://<storage-server-ip>:4080/dssmonitor
  • Login: admin (same credentials as the Storage Server GUI)
  • Password: your Storage Server administrator password

The Checkmk web interface uses the same authentication as the Storage Server GUI — when you change your admin password on the Storage Server, it automatically applies to Checkmk as well. No separate password management is needed.

Note: a legacy cmkadmin user exists in the system for internal compatibility with Checkmk's management tooling (omd, cmk-passwd, automation). It has a random password and is not intended for interactive login. If you need to use it (e.g. for scripting), reset its password from the console: sudo su - dssmonitor -c "cmk-passwd cmkadmin".

Local Storage Server configuration

The local Storage Server is pre-configured in the Checkmk container as local-storage-server (in the storage-servers folder). It receives monitoring data from the local Checkmk agent.

Pre-configured monitoring rules

The default monitoring rules are located at:

 Setup → Agents → Other integrations → Individual program call instead of agent access

Two rules are pre-configured:

Rule 1: Local Storage Server (Enabled)

  • Purpose: monitor the local Storage Server via a shared agent output file
  • Status: enabled by default
  • Explicit Hosts entry: ~local-storage-server — the tilde (~) makes the rule apply to every host name beginning with local-storage-server.
  • Customization: if you prefer a more descriptive host name (e.g. local-storage-server-220), create the new hostname and remove the default local-storage-server entry.

Rule 2: Remote Storage Server via REST API (Disabled)

  • Purpose: monitor additional Storage Servers via the REST API
  • Status: disabled by default
  • Explicit Hosts entry: ~storage-server
  • Default credentials:
 rest_api_user=admin; rest_api_pswd=admin; rest_api_port=82;

Prerequisites on each remote Storage Server:

  1. Enable REST API under System Settings → Administration → REST API access.
  2. Update the command-line credentials above if the remote password differs from admin.

Step-by-step: add a remote Storage Server

  1. On the remote Storage Server, enable REST API access under System Settings → Administration → REST API access.
  2. In Checkmk, create a new host with a name starting with storage-server (e.g. storage-server-220). The REST API rule will apply automatically via the ~storage-server pattern.
  3. Update the credentials in the rule if the remote Storage Server uses different values.
  4. Run service discovery on the new host and verify services are discovered.

Explicit hosts configuration

Rules with tilde-prefixed entries (e.g. ~storage-server, ~local-storage-server) apply to every monitored host whose name begins with the given string:

  • storage-server-220, storage-server-221, …
  • local-storage-server-220, local-storage-server-221, …

Storage space and automatic cleanup

  • Allocated storage: approximately 4 GB is reserved for data collection.
  • Automatic cleanup: Checkmk removes older data when free space falls below 300 MB.

Console access

The Checkmk container provides a web-based terminal for administrative tasks such as migration, site management, and troubleshooting.

URL:

 https://<storage-server-ip>:4200/checkmk/

Login: enter admin as the username and your Storage Server administrator password.

After logging in, you are in a regular user shell. For most administrative tasks you need root access:

 sudo su -

From root, you can manage the Checkmk site directly:

 omd status                              # check site status
 omd stop dssmonitor                     # stop the site
 omd start dssmonitor                    # start the site
 su - dssmonitor -c "cmk -II local-storage-server && cmk -O"   # run discovery

Email notifications

Email notifications are automatically configured from the Storage Server's own email settings — no separate SMTP configuration is needed inside Checkmk.

How it works

  1. Configure email on the Storage Server: System Settings → Administration → Email notifications. Set the SMTP server, From address and To address.
  2. Once saved:
    • the From address is used as the sender for Checkmk notification emails;
    • the To address is used as the destination — Checkmk contacts are configured with a placeholder (admin@localhost) that the email gateway rewrites to your configured To address;
    • changes to either address take effect immediately; no Checkmk restart needed.
  3. Additional customization is available in the Checkmk GUI:
    • notification rules under Setup → Events → Notifications;
    • per-user email settings under Setup → Users.

Important notes

  • Email delivery uses the Storage Server's central SMTP proxy — the same gateway used by ownCloud, cron jobs, etc.
  • The default notification rule emails only for actual state changes (OK↔WARN↔CRIT↔UNKNOWN). Scheduled-downtime start/end and flapping events do not generate emails.
  • To send a test email, use Setup → Events → Notifications → Test notifications.

Automatic service discovery

Checkmk automatically manages the list of monitored services:

  • new services (e.g. from a newly imported pool) are added to monitoring within 2 hours;
  • vanished services (e.g. from an exported pool or a removed plugin) are cleaned up within 2 hours;
  • changes are activated automatically — no manual Activate Changes click needed for discovered services.

So, when you import a new ZFS pool, its capacity, health, snapshot age and compression services appear in Checkmk on their own. When you export a pool, the corresponding services are removed instead of lingering in UNKNOWN state. When a monitoring plugin is added or removed, services are updated accordingly.

Manual discovery

To avoid waiting for the automatic cycle, trigger discovery manually:

  • via the GUI: go to the host → Setup → Services → Full service scan
  • via the console: su - dssmonitor -c "cmk -II local-storage-server && cmk -O"

Reboot behavior

The monitoring container is designed to produce zero spurious notification emails during system reboots:

  • At shutdown: Checkmk automatically enters a scheduled maintenance mode before the container stops, so no service unavailable alerts fire while the system is going down.
  • At boot: pool-specific downtimes are cleared automatically the moment each pool mounts; downtimes on other services (SMART, CPU, latency) are cleared once monitoring has confirmed three consecutive healthy checks. Monitoring then resumes normally.
  • Boot ordering: the container starts only after ZFS pools are imported and the monitoring agent has collected fresh data, so the first check cycle sees everything already in its normal state.

If a real problem persists after the boot window (e.g. a pool fails to import), it is reported normally once the maintenance period for that pool expires.

Renaming the site

The default monitoring site is named dssmonitor. If you prefer a different name (e.g. to match your host or organization):

  1. Open the Checkmk console at https://<ip>:4200/checkmk/
  2. Log in with admin and the Storage Server administrator password
  3. Run:
 sudo su -
 omd stop dssmonitor
 omd mv dssmonitor <newname>
 omd start <newname>

After renaming, the Checkmk GUI is accessible at https://<ip>:4080/<newname>.

Note: the automated email configuration and notification rule apply to the site active at the time setup_mailer runs. If you rename and then change email settings in the Storage Server GUI, the new site name is picked up automatically.

Migrating from a previous revision

If you upgraded from an earlier revision (e.g. rev 03), your Checkmk configuration (users, passwords, monitoring rules, views, dashboards) can be carried over to the new container.

The previous revision's data is stored on disk as an AUFS changes-layer image and remains available after the upgrade. The migration helper reads it, creates a standard Checkmk backup, and restores it into the current container using the built-in omd backup / omd restore mechanism.

Steps

  1. Open the Checkmk console at https://<ip>:4200/checkmk/
  2. Log in with admin and the Storage Server administrator password
  3. Run:
 sudo /usr/local/bin/migrate_from_prev_rev
  1. Wait for the script to complete. It will list each site it migrated.
  2. Log in to the Checkmk GUI — your old users, rules, and views should be present.

What is migrated

  • All Checkmk sites (including renamed sites or additional user-created sites)
  • WATO configuration: hosts, rules, contacts, notification rules, tags, groups
  • Users and passwords (cmkadmin password is preserved; admin is added with PAM auth)
  • Per-user preferences: sidebar layout, saved views, dashboards
  • Discovered service definitions (autochecks)

What is NOT migrated

  • Historical metric data (RRD files) and monitoring logs
  • Temporary files and runtime state
  • Backup job definitions (these were stored in volatile storage)

Re-running

The migration script runs at most once. To re-run (e.g. after a factory reset), remove the marker file:

 sudo rm /var/lib/xc-checkmk/migration-done
 sudo /usr/local/bin/migrate_from_prev_rev

Known issues

  • Two-Factor Authentication (2FA): if 2FA is enabled on the Storage Server (via the oe_2fa module), Checkmk login may not work correctly. The current authentication method (Apache basic auth) does not support the second-factor prompt. 2FA support for Checkmk is planned for a future revision.
  • Checkmk REST API: external access to the Checkmk REST API (e.g. for automation scripts using Bearer token authentication) is not supported in this revision. REST API support is planned for a future revision.

For further customization or troubleshooting, refer to the upstream Checkmk documentation or contact Open-E support.