Cisco Bug: CSCup23821 - NPUMGR Restart causing MME enB SCTP failures and impact to SGs/S1AP
Mar 07, 2018
- Cisco ASR 5000 Series
Known Affected Releases
Symptom: ===Problem Description=== Today BMC experienced another outage due to SFT Failure on demux card. The issue is similar to the issue that was reported last week under SR 630277117 but it is on a different site running 16.0. That issue was discussed in this mailer earlier and I am adding the past conversation in this email below for reference. Micro Engine on NPU of DEMUX card goes in a bad state and cause the outage. The PSC itself does not restart automatically and a manual migration is required to get out of the outage. This issue has become extremely hot since customer is seeing it on multiple sites now. When this issue occurs the MME loses ~30% of the connected eNodeBs. BMC would like to get an answer on what is causing this issue on their network and how to prevent it from happening. === Logs snippet: === 2014-May-23+13:40:49.190 card 1-cpu1: me_fail_count = 1 2014-May-23+13:40:49.191 card 1-cpu1: me_fail_count = 1 2014-May-23+13:40:49.340 card 1-cpu1: SFT FAILURE @ Total Sent 147513744 : lost 35, err 0, Sent 35, Total Err 0 Fri May 23 13:40:55 2014 Internal trap notification 1206 (VLRAssocDown) VLR association down; vpn mme_ctx service mme_svc_sgs vlr BMPH1 address1 126.96.36.199 address2 188.8.131.52 port 29118 Fri May 23 13:40:55 2014 Internal trap notification 1242 (VLRDown) VLR down; vpn mme_ctx service mme_svc_sgs vlr BMPH1 Fri May 23 13:41:04 2014 Internal trap notification 1099 (ManagerRestart) facility npumgr instance 1 card 1 cpu 1 Fri May 23 13:41:04 2014 Internal trap notification 151 (TaskRestart) facility npumgr instance 1 on card 1 cpu 1 Conditions: 1. Due to a momentary hw issue within NPU hw on card-1, we encountered loss of monitoring (kind of heartbeat pkts internal to the chassis)packets. This condition resulted into npumgr-1 restart which is an expected behaviour. The momentary hw issue is evident from the Frame Parser MicroEngine being stuck at 100% utilization. Npumgr restarting due to SF monitoring failure seems to be due to MicroEngine lockup 2. 2014-May-23+13:40:50.370 card 1-cpu1: Frame Parser | 100.00% | 45.84% | 27.50% | 0.00% | 3. 2014-May-23+13:40:50.880 card 1-cpu1: Frame Parser | 100.00% | 45.84% | 27.50% | 0.00% | 4. Due to the npumgr restart on demux card, we could have seen the enodeB associations failed during this time. 5. Once npumgr got restarted and came back, in an ideal scenario, traffic should have come back to normal state over a period of time. However, as per your observation, for some of the eNodeBs, it is not.
Bug details contain sensitive information and therefore require a Cisco.com account to be viewed.
Bug Details Include
- Full Description (including symptoms, conditions and workarounds)
- Known Fixed Releases
- Related Community Discussions
- Number of Related Support Cases