Cisco Bug: CSCvt43958 - Hyperflex Zookeeper - change memory settings related to max usage and cleanup.
Sep 20, 2020
- Cisco HyperFlex HX-Series
Known Affected Releases
3.5(2g) 3.5(2h) 4.0(1a) 4.0(2a) 4.0(2b)
Temporary downgrade of cluster resiliency from HEALTHY state to WARNING, due to main storfs process getting restarted. Restart of storfs happens because the kernel OOM killer kills storfs under increased memory pressure on the controller VM. In some cases, if there are 2 or more simultaneous OOM faults (multiple nodes having high memory pressure), cluster may shutdown, causing storage outage. There is no data loss in either cases. Symptom: As OOM killer kills the main storfs process on a given controller, the resiliency state of cluster will turn to WARNING, but will eventually (& automatically) be restored to HEALTHY state. Under extreme condition, if 2 or more nodes fault simultaneously, the cluster may shutdown, and may have to be restored manually using CLI. There will be no data loss, but workload VMs may suffer storage outage (APD - All paths down) for the duration of cluster downtime. Conditions: We have observed these OOMs under heavy load, such as, large number of VMs being snapshotted for backup purpose. In our extensive testing done internally, without heap limit in place for ZK process, we have seen OOM affecting just a single node in the cluster. Users may see cluster health transition from HEALTHY to WARNING.
Bug details contain sensitive information and therefore require a Cisco.com account to be viewed.
Bug Details Include
- Full Description (including symptoms, conditions and workarounds)
- Known Fixed Releases
- Related Community Discussions
- Number of Related Support Cases