Guest

Preview Tool

Cisco Bug: CSCvs90143 - ESC enters fault state after health status monitor deadlocks

Last Modified

May 31, 2020

Products (1)

  • Cisco Elastic Services Controller

Known Affected Releases

5.0(0.124) 5.1

Description (partial)

Symptom:
The ESC health monitoring service can deadlock resulting in the loss of critical network and storage services required to resolve ESC HA role.  When this condition latches both ESCs in an HA cluster will go into fault state and will no longer respond to deployment requests.

This condition can be confirmed by running health.sh on both ESC nodes.  Command output will show ESC instances are in drbd backuop mode; escmanager is stopped.  In addition the following can be observed:

[1] These logs in /var/log/esc/escmanager.log
 Feb  1 09:44:08 ESC-230-31 Keepalived_vrrp[1907]: /opt/cisco/esc/esc-init/esc_monitor.py -s -t network exited with status 1
 Feb  1 09:44:08 ESC-230-31 Keepalived_vrrp[1907]: /opt/cisco/esc/esc-init/esc_monitor.py -s -t storage exited with status 1

[2] Missing file: /tmp/.esc_tmp.lock

Conditions:
The condition is entered when file tmp/.esc_tmp.lock is removed.  On removal the ESC health monitoring service will attempt to re-create, but is unable due to an underlying SELinux policy.  As a result the monitoring service can't start and ESC manager service stops. 

The lock file can be removed when /tmp is cleaned-up (10 day cycle) or on ESC VM reboot.  Testing has confirmed that these operations do not always result in the removal of the lock file.
Bug details contain sensitive information and therefore require a Cisco.com account to be viewed.

Bug Details Include

  • Full Description (including symptoms, conditions and workarounds)
  • Status
  • Severity
  • Known Fixed Releases
  • Related Community Discussions
  • Number of Related Support Cases
Bug information is viewable for customers and partners who have a service contract. Registered users can view up to 200 bugs per month without a service contract.