One of the things that comes up quite a lot in the consulting work I do with customers is the concept of multiple failures in the context of Business Continuity Planning [BCP] and Disaster Recovery [DR] discussions. In simple terms, two or more failures on their own could have relatively benign effects but combined the impact can have a severity far beyond the sum of the individual severities – a sort of negative synergy.
Looking for, understanding and mitigating the risk of these kind of failures is difficult in simple systems and nigh-on impossible in complex ones. History is littered with examples of these kind of failures in various settings. Some examples:
These are obviously some rather serious examples but smaller examples occur all the time within businesses (not just within IT) and their impact can be significant for the affected. I come across lots of situations where these kind of multiple failures have occured and I hope to share some of these in the future along with some thoughts on how to analyse infrastructures for and mitigate this kind of risk.
One instance I can share occured this week. A failure of the mains supply occured at one of a customer’s sites, likely caused by the nasty weather here lately, which caused the APC UPS to send a message to all physical servers letting them know to shutdown. All servers with the APC software shut down automatically whilst the Hyper-V hosts were shut down manually – all good so far. Several hours later, the power was restored and everything powered on automatically as it was supposed to. DCs booted, VMs restored from their saved states and so on.
That would be the end of a normal story. What happened next was unfortunate.
As a result of everything being off for several hours, the inclement weather caused the temperature in the server room to drop to about 15C, below the minimum temperature setpoint configured for the environmental monitoring probe on the UPS. This temperature out-of-range event caused a subsequent message to the phsyical servers which triggered another shutdown.
The temperature in the room gradually increased as the other equipment (PBX, SAN, switches, etc) remained powered on – along with the Hyper-V nodes that didn’t have the APC software installed. The physical servers didn’t turn back on again because the automatic power on only triggers when power is restored to the server which a temperature violation doesn’t cause.
The environment contains one physical and one virtual domain controller. The physical domain controller was off at this point as a result of the second shutdown whilst the virtual remained up. Unfortunately, the stop action for the DC’s VM was Save State, not Shutdown. This meant that the DC resumed at the time at which its state was saved – several hours before. This time was then propogated out to the other physical servers and then to the other VMs via Hyper-V’s time sync service causing a time skew that generated a whole load of Kerberos issues.
To resolve the issues, the physical servers were booted up remotely via their iLO interfaces, the domain controllers were resynched with an external time source followed by the Hyper-V machines and then most servers were rebooted to clear any errors and ensure correct service startup. Obviously this took some diagnosing and a lot of manual work which could have been avoided.
To recap, here are the issues:
- Initial power failure
- Low temperature alarm
- APC software not installed on Hyper-V nodes
- Incorrect stop action on DC VMs
On their own, not issues – together, several hours of early morning headaches!