Guest

Preview Tool

Cisco Bug: CSCuq93120 - DPC4 shutdown due to heartbeat issues / rebooted causing MMIO switchover

Last Modified

Feb 28, 2017

Products (1)

  • Cisco ASR 5000 Series

Known Affected Releases

15.0(500)

Description (partial)

Symptom:
DPC 4 shutdown due to heartbeat issues:
 
Sep 16 20:36:54 [10.192.98.223.203.236] evlogd: [local-60sec54.641] [hat 3014 critical] [5/0/6030 <hatsystem:0> hat_hb_lib.c:743] [software internal system syslog] HAT instance 0 found Critical task hatcpu/41 failed on cpu 4/1. Missed heartbeat sequence 1782542 ns_age 3 bounce_code N/A uptime 1783440.
hb 1782532 :43.440, RV, - 00000000 00000000, tx 0ms, rx 0ms, rtt 0ms
hb 1782533 :44.441, RV, - 00000000 00000000, tx 0ms, rx 0ms, rtt 0ms
hb 1782534 :45.441, RV, - 00000000 00000000, tx 0ms, rx 0ms, rtt 0ms
hb 1782535 :46.442, RV, - 00000000 00000000, tx 0ms, rx 0ms, rtt 0ms
hb 1782536 :47.443, RV, - 00000000 00000000, tx 0ms, rx 0ms, rtt 0ms
hb 1782537 :48.443, RV, - 00000000 00000000, tx 0ms, rx 0ms, rtt 0ms
hb 1782538 :49.444, RV, - 00000000 00000000, tx 0ms, rx 0ms, rtt 0ms
hb 1782539 :50.444, RV, - 00000000 00000000, tx 0ms, rx 0ms, rtt 0ms
hb 1782540 :51.445, RV, - 00000000 00000000, tx 0ms, rx 0ms, rtt 0ms
hb 1782541 :52.446, BV, - 00000000 00000000, tout 1193ms, age 3
afd 17733546 :53.079, rtt 0ms
afd 17733547 :53.180, rt
..
Sep 16 20:36:54 [10.192.98.223.203.236] evlogd: [local-60sec54.666] [hat 3033 error] [5/0/6030 <hatsystem:0> atsystem_fail.c:1463] [hardware internal system critical-info diagnostic] Card error detected on card 4 device CPU_1 reason CPU_CRITICAL_TASK_FAILURE
Sep 16 20:36:54 [10.192.98.223.203.236] evlogd: [local-60sec54.681] [csp 7019 critical] [5/0/6113 <cspctrl:0> spctrl_events.c:3954] [hardware internal system diagnostic] The Data Processing Card with serial number SAD152902ZB in slot 4 has failed and will be reset and brought back online. (Device=CPU_1, Reason=CPU_CRITICAL_TASK_FAILURE, Status=[BOARD:] [CPU0 MB: Boot Done HB_cpu: 0E:0F Error ID: None BPL: None] [CPU1 MB: Boot Done HB_cpu: 0F:00 Error ID: None BPL: None] [CPU2 BPL: None] [CPU3 BPL: None])
 
Card 10 was then brought up as Active:
 
Sep 16 20:36:54 [10.192.98.223.203.236] evlogd: [local-60sec54.723] [snmp 22002 info] [5/0/6113 <cspctrl:0> trap_api.c:367] [software internal system syslog] Internal trap notification 55 (CardActive) card 10 type Data Processing Card
 
Card 10 starts to have memory issues on multiple SESSMGRs:

Sep 16 20:36:56 [10.192.98.223.203.236] evlogd: [local-60sec56.244] [resmgr 14508 error] [10/1/4563 <rmmgr:101> _resource_cpu.c:3631] [software internal system critical-info syslog] The task sessmgr-5132 is way over its memory limit! Allocated 235520K, Using 270044K
Sep 16 20:36:56 [10.192.98.223.203.236] evlogd: [local-60sec56.244] [resmgr 14508 error] [10/1/4563 <rmmgr:101> _resource_cpu.c:3631] [software internal system critical-info syslog] The task sessmgr-8 is way over its memory limit! Allocated 235520K, Using 335564K
 
MIO Card 5 then experiences a task restart (The Core File has been attached to the SR):
 

Fatal Signal 11: Segmentation fault
  PC: [08c0a246/X] sn_csp_get_pport_info_cleanup()
  Faulty address: 0x8618
  Signal from: kernel
  Signal detail: address not mapped to object
  Process: card=5 cpu=0 arch=X pid=7744 cpu=~1% argv0=snmp
  time: 2014-Sep-16+20:40:43 UTC
  Recent errno: 115 Operation now in progress
  Stack (15288@0xffffa000):
    [08c0a246/X] sn_csp_get_pport_info_cleanup() sp=0xffffa5d8
    [00ceaced/X] snmp_retrieve_port_data() sp=0xffffa648
    [00ceb0a5/X] snmp_get_port_data() sp=0xffffa658
    [00ceb100/X] var_starPortTable() sp=0xffffa678
    [0aa204a4/X] search_subtree_vars() sp=0xffffaae8
    [0aa20b95/X] getStatPtr() sp=0xffffad48
    [0aa1fd9c/X] handle_one_var() sp=0xffffafc8
    [0aa1fc38/X] handle_
 
After this task restart, there are quite a bit of logs that we need to get investigated by the BU to see what happened, it doesn't look good:
 
Sep 16 20:40:45 [10.192.98.223.203.236] evlogd: [local-60sec45.317] [stat 31025 error] [5/0/7705 <bulkstat:0> stat_schema.c:97013] [software internal system syslog] Unable to retrieve Error in getting pport info rc=3
..
Sep 16 20:40:49 [10.192.98.223.203.236] evlogd: [local-60sec49.071] [snmp 22002 info] [1/0/4563 <rmmgr:10> trap_api.c:16006] [software internal system syslog] Internal trap notification 1217 (MemoryWarn) facility rmmgr instance 10 card 1 cpu 0 allocated 23552 used 24020
..
Sep 16 20:40:58 [10.192.98.223.203.236] evlogd: [local-60sec37.978] [afmgr 187002 warning] [4/1/4565 <afmgr:41> fmgr_ipc_msgs.c:512] [software internal system critical-info syslog] afmgr_ipc_rx_sct_message: Error updating Switch Fabric SSR DB entry in SCT for 12=4/2
..
Sep 16 20:41:35 [10.192.98.223.203.236] evlogd: [local-60sec35.636] [resmgr 14508 error] [5/0/6035 <rmmgr:50> _resource_cpu.c:3631] [software internal system critical-info syslog] The task rmctrl-0 is way over its memory limit! Allocated 66560K, Using 94000K
..
Sep 16 20:44:43 [10.192.98.223.203.236] evlogd: [local-60sec43.551] [sitmain 4053 warning] [5/0/6111 <dcardctrl:0> sit_api.c:508] [software internal system critical-info syslog] sit operation failed rm api failed return code (SN_STATUS_ACCESS_DENIED)
Sep 16 20:44:43 [10.192.98.223.203.236] evlogd: [local-60sec43.747] [resmgr 14701 warning] [5/0/6094 <rmctrl:0> rmctrl_events.c:579] [software internal system critical-info syslog] The resources needed for task npumgr/41 could not be allocated to any active CPU.  Reason:CPU 4/1: No data No cpus matched criteria (fl:0,min:0.00,inc:2/[41:29],exc:0/[0:0],exc:0/[0:0])
..
Sep 16 20:46:43 [10.192.98.223.203.236] evlogd: [local-60sec43.944] [csp 7906 error] [5/0/6030 <hatsystem:0> csp_api_msg.c:1322] [software internal system syslog] sn_csp_notify_card_failed_nonblocking_callback: bounce
..
 
Then there were additional task restart records on card 5 that state "User Initiated" but that is really the system collecting data, it's not a real task restart.
 
Here's another interesting log:
 
Sep 16 20:48:46 [10.192.98.223.203.236] evlogd: [local-60sec46.482] [snmp 22022 error] [5/0/27015 <snmp:0> agt_slot.c:78] [software internal system syslog] Unable to retrieve Chassis Table, internal error code 3
 
Card 6 now took over:
 
Sep 16 20:48:47 [10.192.98.223.203.236] evlogd: [local-60sec47.561] [evlog 2134 info] [6/0/8383 <evlogd:1> odule_persist.c:2165] [software internal system critical-info syslog] Evlogd : write new record at empty slot 5
Sep 16 20:48:47 [10.192.98.223.203.236] evlogd: [local-60sec47.561] [hat 3052 warning] [6/0/8385 <hatsystem:1> hatsystem.c:714] [software internal system critical-info syslog] OOB response from card 5 cpu 0 172.16.4.1:
Messenger peer 20c1060053fd6929 (cls 5) unknown to hatcpu/162000-50 pid 6031 card 5 cpu 0
Messenger req 20c1060054189e0d not found in wip_hash
Loadavg: 2.24 2.33 2.41 7/295 734
Ip: 1 64 20425818990 0 0 0 0 0 20413728602 10868611435 0 0 0 0 0 0 0 0 0
Tcp: 1 200 120000 -1 76439639 26141381 3768 81305 3998 18292629206 10162156722 7806 0 78298
Udp: 479948440 7171061 2725087 123803269 44 0
 
Then we have the last log and a gap in the SYSLOG data:
 
Sep 16 20:48:47 [10.192.98.223.203.236] evlogd: [local-60sec47.561] [hat 3073 critical] [6/0/8385 <hatsystem:1> hatsystem_msg.c:907] [software internal system critical-info syslog] primary hatsystem dead! triggering spc switchover: LINUX_RC: Done [hb_cpu: 0C:D4, hb_sys: 0000003C:DE]
 
Sep 16 21:14:10 [10.192.98.223.200.212] evlogd: [local-60sec10.723] [bgp 85000 critical] [5/0/16780 <bgp:2> zebos_bgp_api.c:1113] [software internal system critical-info] Route Modifier value of 0 has been updated for neighbor 10.119.91.16 in AS number 65120

Conditions:
HW: ASR 5500
SW: 15.0 (54944)
Product/Feature: PGW
 
Time of Incident: Sep 16 20:36:54 UTC
Duration of Incident: ~38 minutes (From the time of 1st Heartbeat failures to the point the SYSLOG Server started receiving messages again.)
Reoccurrences: Once
Bug details contain sensitive information and therefore require a Cisco.com account to be viewed.

Bug Details Include

  • Full Description (including symptoms, conditions and workarounds)
  • Status
  • Severity
  • Known Fixed Releases
  • Related Community Discussions
  • Number of Related Support Cases
Bug information is viewable for customers and partners who have a service contract. Registered users can view up to 200 bugs per month without a service contract.