Cisco Bug: CSCvs90143 - ESC enters fault state after health status monitor deadlocks
May 31, 2020
- Cisco Elastic Services Controller
Known Affected Releases
Symptom: The ESC health monitoring service can deadlock resulting in the loss of critical network and storage services required to resolve ESC HA role. When this condition latches both ESCs in an HA cluster will go into fault state and will no longer respond to deployment requests. This condition can be confirmed by running health.sh on both ESC nodes. Command output will show ESC instances are in drbd backuop mode; escmanager is stopped. In addition the following can be observed:  These logs in /var/log/esc/escmanager.log Feb 1 09:44:08 ESC-230-31 Keepalived_vrrp: /opt/cisco/esc/esc-init/esc_monitor.py -s -t network exited with status 1 Feb 1 09:44:08 ESC-230-31 Keepalived_vrrp: /opt/cisco/esc/esc-init/esc_monitor.py -s -t storage exited with status 1  Missing file: /tmp/.esc_tmp.lock Conditions: The condition is entered when file tmp/.esc_tmp.lock is removed. On removal the ESC health monitoring service will attempt to re-create, but is unable due to an underlying SELinux policy. As a result the monitoring service can't start and ESC manager service stops. The lock file can be removed when /tmp is cleaned-up (10 day cycle) or on ESC VM reboot. Testing has confirmed that these operations do not always result in the removal of the lock file.
Bug details contain sensitive information and therefore require a Cisco.com account to be viewed.
Bug Details Include
- Full Description (including symptoms, conditions and workarounds)
- Known Fixed Releases
- Related Community Discussions
- Number of Related Support Cases