Guest

Preview Tool

Cisco Bug: CSCvt14410 - AA leader can deadlock in "running but in error" state

Last Modified

May 31, 2020

Products (1)

  • Cisco Elastic Services Controller

Known Affected Releases

5.0 5.1

Description (partial)

Symptom:
On initial ESC AA deployment or across an AA switchover there is a window where an internal dns lookup can fail while ESC is in the process of restarting services.  The dns lookup failure stalls ESCs access to its database resulting in a database migration failure.  Once this condition has latched the ESC cluster is deadlocked and cannot process requests until the condition is cleared. 

In this state ESC health check will return the following:

[admin@user-2 esc]$ health.sh
============== ESC =================
quagga (pgid 16916) is running
consul_template (pgid 3028) is running
vimmanager (pgid 11084) is running
monitor (pgid 12347) is running
mona (pgid 12440) is running
drbd (pgid 3002) is master
consul (pgid 2646) is running
etsi is stopped
pgsql (pgid 16059) is running
elector (pgid 10910) is leader
filesystem (pgid 0) is running
confd (pgid 17082) is phase0
geo (pgid 2611) is primary
escmanager (pgid 18316) is running but in error
=======================================
ESC HEALTH FAILED

The following ESC manager log confirms the database migration failure:

2020-02-20 18:36:42.618 http-nio-0.0.0.0-8080-exec-8 ERROR [Slf4jLog.java:error:52] [tid=d244d56d-625b-4e8b-af5d-c12aedf63e6d] Migration of schema "esc_schema" to version 1.52 failed! Changes successfully rolled back.
2020-02-20 18:36:42.992 http-nio-0.0.0.0-8080-exec-8 ERROR [ESCManager.java:goOperational:375] [tid=d244d56d-625b-4e8b-af5d-c12aedf63e6d] Database migration was not successful.
com.cisco.esc.db.migration.exceptions.DatabaseMigrationException: org.flywaydb.core.api.FlywayException: Migration failed !
        at com.cisco.esc.db.migration.service.FlywayBasedMigration.migrateDatabase(FlywayBasedMigration.java:163)

Conditions:
The window where the dns failure can exhibit is short and is dependent on how long its takes to restart the dependent service.

Problem has only been observed in internal testing under conditions where ESC is purposely resource constrained to simulate worse case operational conditions.
Bug details contain sensitive information and therefore require a Cisco.com account to be viewed.

Bug Details Include

  • Full Description (including symptoms, conditions and workarounds)
  • Status
  • Severity
  • Known Fixed Releases
  • Related Community Discussions
  • Number of Related Support Cases
Bug information is viewable for customers and partners who have a service contract. Registered users can view up to 200 bugs per month without a service contract.