Cisco Bug: CSCvu67110 - RHV - VMMMgr crashes/high CPU on APIC.
Aug 27, 2020
- Cisco Application Policy Infrastructure Controller (APIC)
Known Affected Releases
Symptom: This defect covers a patch for a condition similar to CSCvn15769 but not covered in the patch for it. There are recurring crashes and core dumps on different Cisco APICs (which are VMM domain shard leaders), as well as high CPU utilization (around 200% so to 2x maxed out CPU cores) for the VMMMGr process, as well as multiple inv sync issues. These issues are preventing the VMMMGr process from processing any operational/configuration changes that are made on the RHVs. This can be resolved these by repeatedly restarting the vmmmgr process (the aforementioned cores are NOT caused by the process restarts). However, restarting a DME is not a recommended workaround. The decoded core files as well as the vmmmgr logs have shown the following: > Decoded cores show us consistently coring vmmmgr when we call the following functions: > vmm_rhev::RHEVController::getHvs > ... in vmm_rhev::RHEVController::getHvs(vmm::Connection&, std::map<base::String, mo::Mo*, std::less<base::String>, std::allocator<std::pair<base::String const, mo::Mo*> > >&, comp::CtrlrMo*) > vmm_rhev::RHEVController::getVms > ... in vmm_rhev::RHEVController::getVms(vmm::Connection&, std::map<base::String, mo::Mo*, std::less<base::String>, std::allocator<std::pair<base::String const, mo::Mo*> > >&, comp::CtrlrMo&) > vmm_rhev::RHEVController::getInventory > ... in vmm_rhev::RHEVController::getInventory() () from /vol/ifc-rel-imgs/3.2-1m/mgmt/usr/lib64/libsvc_ifc_vmmmgr.so VMMMgr logs show similar issues with us having issues when running GETs against the above listed objetcs such as hypervisors, VM's (easy) CURL seems to be erroring out and some unexpected URLs when performing the above get functions. All domains are affected by this: > 23971||18-10-16 10:29:26.255+02:00||ifc_vmmmgr||DBG4||||getDataCenterId||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/rhev/RhevController.cc||1587 > 23971||18-10-16 10:29:26.257+02:00||rest_client||ERROR||||CURL failure: curl_easy_perform returned 6||../common/src/restclient/./CurlImpl.cc||269 > 23971||18-10-16 10:29:26.257+02:00||ifc_vmmmgr||WARN||||Error in URL: datacenters||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/rhev/RhevConnection.cc||50 > 23971||18-10-16 10:29:26.257+02:00||ifc_vmmmgr||WARN||||Error response: ||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/rhev/RhevConnection.cc||64 > 23971||18-10-16 10:29:26.257+02:00||ifc_vmmmgr||WARN||||RHEV-sfb-pro01: sfbpro01: 0x562e47413510: Data Center sfbpro01 not found||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/rhev/RhevController.cc||414 > 23971||18-10-16 10:29:26.257+02:00||ifc_vmmmgr||WARN||||RHEV-sfb-pro01: sfbpro01: 0x562e47413510: Failed to get LNode inventory ||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/Controller.cc||2105 ...snip... > 23971||18-10-16 10:29:26.260+02:00||rest_client||ERROR||||CURL failure: curl_easy_perform returned 6||../common/src/restclient/./CurlImpl.cc||269 > 23971||18-10-16 10:29:26.260+02:00||ifc_vmmmgr||WARN||||Error in URL: vms?search=datacenter=sfbpro01||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/rhev/RhevConnection.cc||50 > 23971||18-10-16 10:29:26.260+02:00||ifc_vmmmgr||WARN||||Error response: ||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/rhev/RhevConnection.cc||64 > 23971||18-10-16 10:29:26.260+02:00||ifc_vmmmgr||WARN||fn=[getVmInventory]||RHEV-sfb-pro01: sfbpro01: 0x562e47413510: Failed to get Vm inventory ||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/Controller.cc||2166 > VMMgr is also showing repeatedly 'error 289'. Some examples > 23358||18-10-16 10:29:24.428+02:00||ifc_vmmmgr||INFO||||RHEV-str-pro01: RHVP_STR: 0x562e3e920010: Action: ACT_TASK_GET_HV_ADJ ( 21 ) errorCode: 289 l Ret: 1||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/Controller.cc||1081 > 23340||18-10-16 10:29:24.430+02:00||ifc_vmmmgr||INFO||co=doer:5:1:0x2800000000ba35ab:1||processStimulus - Received errorCode: 289||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/Manager.cc||575 > 23340||18-10-16 10:29:24.430+02:00||ifc_vmmmgr||INFO||co=doer:5:1:0x2800000000ba35ab:1||Received errorCode: 289||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/Manager.cc||498 > ... Conditions: The vmmmgr crashes are due to concurrent access by two threads to a library that is not thread safe. This library is involved in the sending of REST requests. The reason for this frequent concurrent access (and possibly the high CPU utilization) is that the retrieval of adjacency information is continuously failing and being retried for the following HTTPS paths at the RHV Controller IP /ovirt-engine/api/hosts<host guid>/nics/<nic guid>/linklayerdiscoveryprotocolelements The GET operations fail with the following error: ?interface type not support lldp? As the adjacency retrieval operations fail and get retried, they frequently overlap with the periodic inventory refresh resulting in the crashes.
Bug details contain sensitive information and therefore require a Cisco.com account to be viewed.
Bug Details Include
- Full Description (including symptoms, conditions and workarounds)
- Known Fixed Releases
- Related Community Discussions
- Number of Related Support Cases