Cisco Bug: CSCvt14410 - AA leader can deadlock in "running but in error" state
May 31, 2020
- Cisco Elastic Services Controller
Known Affected Releases
Symptom: On initial ESC AA deployment or across an AA switchover there is a window where an internal dns lookup can fail while ESC is in the process of restarting services. The dns lookup failure stalls ESCs access to its database resulting in a database migration failure. Once this condition has latched the ESC cluster is deadlocked and cannot process requests until the condition is cleared. In this state ESC health check will return the following: [admin@user-2 esc]$ health.sh ============== ESC ================= quagga (pgid 16916) is running consul_template (pgid 3028) is running vimmanager (pgid 11084) is running monitor (pgid 12347) is running mona (pgid 12440) is running drbd (pgid 3002) is master consul (pgid 2646) is running etsi is stopped pgsql (pgid 16059) is running elector (pgid 10910) is leader filesystem (pgid 0) is running confd (pgid 17082) is phase0 geo (pgid 2611) is primary escmanager (pgid 18316) is running but in error ======================================= ESC HEALTH FAILED The following ESC manager log confirms the database migration failure: 2020-02-20 18:36:42.618 http-nio-0.0.0.0-8080-exec-8 ERROR [Slf4jLog.java:error:52] [tid=d244d56d-625b-4e8b-af5d-c12aedf63e6d] Migration of schema "esc_schema" to version 1.52 failed! Changes successfully rolled back. 2020-02-20 18:36:42.992 http-nio-0.0.0.0-8080-exec-8 ERROR [ESCManager.java:goOperational:375] [tid=d244d56d-625b-4e8b-af5d-c12aedf63e6d] Database migration was not successful. com.cisco.esc.db.migration.exceptions.DatabaseMigrationException: org.flywaydb.core.api.FlywayException: Migration failed ! at com.cisco.esc.db.migration.service.FlywayBasedMigration.migrateDatabase(FlywayBasedMigration.java:163) Conditions: The window where the dns failure can exhibit is short and is dependent on how long its takes to restart the dependent service. Problem has only been observed in internal testing under conditions where ESC is purposely resource constrained to simulate worse case operational conditions.
Bug details contain sensitive information and therefore require a Cisco.com account to be viewed.
Bug Details Include
- Full Description (including symptoms, conditions and workarounds)
- Known Fixed Releases
- Related Community Discussions
- Number of Related Support Cases