Cisco Bug: CSCuq22781 - DB timeout during pruning causes HA failure / split brain
Jan 22, 2016
- Cisco Mobility Services Engine
Known Affected Releases
Symptom: A DB timeout during the database pruning can cause HA failure resulting in both MSE to start tracking clients. Logs before going to split brian stage 16 Jul 2014 23:30:59,687 INFO [DbMonitor-10-209-1-185] - PrimaryDbMonitor - Gap Status: null 16 Jul 2014 23:30:59,687 ERROR [DbMonitor-10-209-1-185] - PrimaryDbMonitor - monitorDbStatus: Error retrieving replication status: Reason - java.lang.NullPointerException java.lang.NullPointerException at com.cisco.mse.ha.db.PrimaryDbMonitor.getArchiveDestStatus(PrimaryDbMonitor.java:378) at com.cisco.mse.ha.db.PrimaryDbMonitor.monitorDbStatus(PrimaryDbMonitor.java:276) at com.cisco.mse.ha.db.PrimaryDbMonitor.run(PrimaryDbMonitor.java:228) at java.lang.Thread.run(Thread.java:662) 16 Jul 2014 23:31:00,690 WARN [DbMonitor-10-209-1-185 DB timeout soon after. 16 Jul 2014 23:31:00,690 WARN [DbMonitor-10-209-1-185] - PrimaryDbMonitor - Shutting down DB monitor for peer: 10.209.1.185 16 Jul 2014 23:31:02,215 INFO [AppMonitor] - AppManagerPrimary - Checking for sub-service: Aeroscout Tag Engine 16 Jul 2014 23:31:02,616 INFO [HM:10-209-1-185] - HealthMonitorPrimaryInst - Heartbeats are up but DB monitor is down. Shutting down heartbeats also 16 Jul 2014 23:31:02,668 ERROR [HM:10-209-1-185] - HealthMonitorPrimaryInst - Failed to get status from secondary server about the database. Marking state to be primary lost secondary Conditions: MSE 22.214.171.124 in HA mode and database pruning task being executed.
Bug details contain sensitive information and therefore require a Cisco.com account to be viewed.
Bug Details Include
- Full Description (including symptoms, conditions and workarounds)
- Known Fixed Releases
- Related Community Discussions
- Number of Related Support Cases