Critical system error response policy
Jump to navigation
Jump to search
Important! Please be aware that these settings are applicable only for a single node configuration or for a cluster configuration with the other node being unavailable.
For cluster configuration with both nodes available the policy is set to immediate reboot in all cases.
For cluster configuration with both nodes available the policy is set to immediate reboot in all cases.
A system reboot may be necessary when a critical error is detected. The administrator may choose to handle different errors in a different manner.
Possible critical errors are divided into three categories:
- ZFS pool I/O suspend: errors from this group are raised in case an uncorrectable I/O failure is encountered during read/write operation to the Pool. The I/O operation is suspended and the system awaits a reboot.
- Kernel oops or bug: kernel oops is defined as a deviation from the correct behavior of the Linux kernel that produces a certain error log. Such an error is not fatal to the system but may be dangerous to the system’s stability. Kernel oops often precedes a kernel panic, causing the system to immediately shutdown. Kernel bug refers to an internal error in the kernel code. Un-Kh errors put the system integrity at risk. It is highly recommended that a reboot is performed immediately to avoid unexpected failures.
- Out-of-memory error: This error, abbreviated as OOM, refers to the state of the system where no additional memory can be allocated for use by programs or the operating system. It is necessary to free up or add memory to the system to recover the system operation. Once this error occurs the system enters an unresponsive state until the memory issue is solved. It is highly recommended to reboot the system at the first moment possible.
For each of the mentioned categories the following behavior patterns can be configured:
- Immediate: system will reboot the machine immediately after the error occurs (the event will not be recorded in the event viewer).
- Automatic: system will restart in 30 seconds from when the errors appear.
- Manual: system will prompt for manual restart.