Cisco Bug: CSCuo89748 - GUI forces to close frequently due to "event sequencing is skewed"
Apr 21, 2017
- Cisco Unified Computing System
Known Affected Releases
2.1(2.154)B 2.2(1.135)A 2.2(1.141)A 2.2(1.160)B 2.2(2)AS5 2.2(2.110)A 2.2(2.117)A 2.2(2.146)A 2.2(2c)A
Due to a different issue on the setup from the ag side, the discovery fsms keep getting re-trigerred. This results in an ongoing onslaught of events being generated by the dme for both mo changes caused by fsms, fsm state transition itself as well as eventRecords being generated for fsm state transitions as well as because there is limited buffer for event records, deletion of events to make space for new events also causes mo change events to the gui. At some point because the system is so overloaded and also because of high cpu utilization observed at both dme and netstack level, the MoChangeEvent, dme writes to buffer fails with error 105 which means "no buffer space available". This is why GUI crashes since some of the events dme sent out was not received by the GUI. So this results in the GUI crashing. DME doesnt have a retry or reliable delivery of mo change events to the gui and hence these failed writes are never retried. This is not a normal scenario since the system is so heavily overloaded, the socket buffer are out of space and it could be possible that the gui is also not performant enough to consume the events as fast as its being generated in this scale scenario. Hence the issue. Now UCSM is not designed for such scale where fsm's keep running for ever and hence we would normally never run into this scenario and if we do run, other things are wrong in the system which must be diagnosed. This issue has been seen before. A couple of things we could also do is see how we can make the gui event consumer more performant or explore ways of making dme more performant in an overloaded system A good to have fix in my opinion is when we run into this type of scenario, we should provide a knob to turn off certain kinds of events dynamically, so that the gui is still usable. This will provide better debuggability of the system and would be a nice to have feature. I will try to run this idea with Mark and Sebastien and see if we can implement it in the near future. Symptom: The GUI crashes with a message, "Fatal Error: event squencing is skewed, would you like to login again or exit". Conditions: This would normally never occur. Generally it will always be seen in a scale system where things are not working properly for e.g. on this system discovery was getting re-triggered for all 100 or more blades causing excessive events to be generated.
Bug details contain sensitive information and therefore require a Cisco.com account to be viewed.
Bug Details Include
- Full Description (including symptoms, conditions and workarounds)
- Known Fixed Releases
- Related Community Discussions
- Number of Related Support Cases