Cisco Bug: CSCvv55870 - DNAC:S&P:112C:Identitymgmt service continuously restarting on DR 6+1 scale cluster after 18 days run
Oct 16, 2020
- Cisco DNA Center
Known Affected Releases
This is an internal class of errors, rarely yet seen in scale load DR solution cluster over time.. In this case it was started as identitymgmt failure.. but logs indicates one of the mongo POD has gone into Name resolution lookup failure (meaning k8s control plane has not fully wired up this POD). You may check (taking mongodb-0 as example here) kubectl describe pod -n maglev-system mongodb-0 kubectl get ep -n maglev-system external-mongodb-0 -o yamlkubectl get ep -n maglev-system external-mongodb-0 -o yaml kubectl describe ep -n maglev-system external-mongodb-0 all will indicate the POD as running but in NotReady state.. This will be service impacting and for this release would need manual intervention to heal the runtime Symptom: The external symptoms may be seen as "identitymanager" failure.. but logs would indicate one of the mongo instance is in "Not Ready" state (and hence "nslookup" fails). One or more service POD may be running but name look up will indicate failure.. And the commands for the respective POD will indicate them in "Not Ready" state kubectl describe pod -n maglev-system mongodb-0 kubectl get ep -n maglev-system external-mongodb-0 -o yamlkubectl get ep -n maglev-system external-mongodb-0 -o yaml kubectl describe ep -n maglev-system external-mongodb-0 If you check "journalctl -u kubelet" you would see this Error Trace Aug 30 02:56:30 maglev-master-172-21-21-10 kubelet: E0830 02:56:30.128184 138328 desired_state_of_world_populator.go:298] Error processing volume "mongodb-data" for pod "mongodb-0_maglev-system(465cdfdd-9b62-4062-9fe1-eb06354944c4)": error processing PVC "maglev-system"/"mongodb-data-mongodb-0": failed to fetch PVC maglev-system/mongodb-data-mongodb-0 from API server. err=Get https://127.0.0.1:9443/api/v1/namespaces/maglev-system/persistentvolumeclaims/mongodb-data-mongodb-0: read tcp 127.0.0.1:42958->127.0.0.1:9443: use of closed network connection The signature you may look for is "desired_state_of_world_populator.go" and " error processing PVC" " failed to fetch PVC" in kubelet logs. Under this condition, the best recovery option is restart kubelet systemd service (systemctl restart kubelet) (Re: https://github.com/kubernetes/kubernetes/issues/87615 https://github.com/golang/go/issues/39750 ) Conditions: The external visible condition be that "nslookup" for "Running POD" would result in failure. This is a rarely seen issue.. Still open in Kubernetes infra.
Bug details contain sensitive information and therefore require a Cisco.com account to be viewed.
Bug Details Include
- Full Description (including symptoms, conditions and workarounds)
- Known Fixed Releases
- Related Community Discussions
- Number of Related Support Cases