Preview Tool

Cisco Bug: CSCvu67110 - RHV - VMMMgr crashes/high CPU on APIC.

Last Modified

Aug 27, 2020

Products (1)

  • Cisco Application Policy Infrastructure Controller (APIC)

Known Affected Releases


Description (partial)

This defect covers a patch for a condition similar to CSCvn15769 but not covered in the patch for it.

There are recurring crashes and core dumps on different Cisco APICs (which are VMM domain shard leaders), as well as high CPU utilization (around 200% so to 2x maxed out CPU cores) for the VMMMGr process, as well as multiple inv sync issues.

These issues are preventing the VMMMGr process from processing any operational/configuration changes that are made on the RHVs.

This can be resolved these by repeatedly restarting the vmmmgr process (the aforementioned cores are NOT caused by the process restarts). However, restarting a DME is not a recommended workaround.
The decoded core files as well as the vmmmgr logs have shown the following:

>	Decoded cores show us consistently coring vmmmgr when we call the following functions: 
>	vmm_rhev::RHEVController::getHvs 
>	... in vmm_rhev::RHEVController::getHvs(vmm::Connection&, std::map<base::String, mo::Mo*, std::less<base::String>, std::allocator<std::pair<base::String const, mo::Mo*> > >&, comp::CtrlrMo*)
>	vmm_rhev::RHEVController::getVms 
>	... in vmm_rhev::RHEVController::getVms(vmm::Connection&, std::map<base::String, mo::Mo*, std::less<base::String>, std::allocator<std::pair<base::String const, mo::Mo*> > >&, comp::CtrlrMo&)
>	vmm_rhev::RHEVController::getInventory 
>	... in vmm_rhev::RHEVController::getInventory() () from /vol/ifc-rel-imgs/3.2-1m/mgmt/usr/lib64/

	VMMMgr logs show similar issues with us having issues when running GETs against the above listed objetcs such as hypervisors, VM's (easy) CURL seems to be erroring out and some unexpected URLs when performing the above get functions. All domains are affected by this:

>	23971||18-10-16 10:29:26.255+02:00||ifc_vmmmgr||DBG4||||getDataCenterId||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/rhev/||1587
>	23971||18-10-16 10:29:26.257+02:00||rest_client||ERROR||||CURL failure: curl_easy_perform returned 6||../common/src/restclient/./||269
>	23971||18-10-16 10:29:26.257+02:00||ifc_vmmmgr||WARN||||Error in URL: datacenters||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/rhev/||50
>	23971||18-10-16 10:29:26.257+02:00||ifc_vmmmgr||WARN||||Error response: ||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/rhev/||64
>	23971||18-10-16 10:29:26.257+02:00||ifc_vmmmgr||WARN||||RHEV-sfb-pro01: sfbpro01: 0x562e47413510: Data Center  sfbpro01 not found||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/rhev/||414
>	23971||18-10-16 10:29:26.257+02:00||ifc_vmmmgr||WARN||||RHEV-sfb-pro01: sfbpro01: 0x562e47413510: Failed to get LNode inventory ||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/||2105
>	23971||18-10-16 10:29:26.260+02:00||rest_client||ERROR||||CURL failure: curl_easy_perform returned 6||../common/src/restclient/./||269
>	23971||18-10-16 10:29:26.260+02:00||ifc_vmmmgr||WARN||||Error in URL: vms?search=datacenter=sfbpro01||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/rhev/||50
>	23971||18-10-16 10:29:26.260+02:00||ifc_vmmmgr||WARN||||Error response: ||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/rhev/||64
>	23971||18-10-16 10:29:26.260+02:00||ifc_vmmmgr||WARN||fn=[getVmInventory]||RHEV-sfb-pro01: sfbpro01: 0x562e47413510: Failed to get Vm inventory ||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/||2166
>	VMMgr is also showing repeatedly 'error 289'. Some examples 
>	23358||18-10-16 10:29:24.428+02:00||ifc_vmmmgr||INFO||||RHEV-str-pro01: RHVP_STR: 0x562e3e920010: Action: ACT_TASK_GET_HV_ADJ ( 21 ) errorCode: 289 l
Ret: 1||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/||1081
>	23340||18-10-16 10:29:24.430+02:00||ifc_vmmmgr||INFO||co=doer:5:1:0x2800000000ba35ab:1||processStimulus - Received errorCode: 289||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/||575
>	23340||18-10-16 10:29:24.430+02:00||ifc_vmmmgr||INFO||co=doer:5:1:0x2800000000ba35ab:1||Received errorCode: 289||../svc/vmmmgr/src/gen/ifc/app/./imp/vmm/||498
>	...

The vmmmgr crashes are due to concurrent access by two threads to a library that is not thread safe. 
This library is involved in the sending of REST requests. 
The reason for this frequent concurrent access (and possibly the high CPU utilization) is that the retrieval of adjacency information is continuously failing and being retried for the following HTTPS paths at the RHV Controller IP
    /ovirt-engine/api/hosts<host guid>/nics/<nic guid>/linklayerdiscoveryprotocolelements
The GET operations fail with the following  error: ?interface type not support lldp?
As the adjacency retrieval operations fail and get retried, they frequently overlap with the periodic inventory refresh resulting in the crashes.
Bug details contain sensitive information and therefore require a account to be viewed.

Bug Details Include

  • Full Description (including symptoms, conditions and workarounds)
  • Status
  • Severity
  • Known Fixed Releases
  • Related Community Discussions
  • Number of Related Support Cases
Bug information is viewable for customers and partners who have a service contract. Registered users can view up to 200 bugs per month without a service contract.