Multiple Failures – UPS Edition

One of the things that comes up quite a lot in the consulting work I do with customers is the concept of multiple failures in the context of Business Continuity Planning [BCP] and Disaster Recovery [DR] discussions. In simple terms, two or more failures on their own could have relatively benign effects but combined the impact can have a severity far beyond the sum of the individual severities – a sort of negative synergy.

Looking for, understanding and mitigating the risk of these kind of failures is difficult in simple systems and nigh-on impossible in complex ones. History is littered with examples of these kind of failures in various settings. Some examples:

These are obviously some rather serious examples but smaller examples occur all the time within businesses (not just within IT) and their impact can be significant for the affected. I come across lots of situations where these kind of multiple failures have occured and I hope to share some of these in the future along with some thoughts on how to analyse infrastructures for and mitigate this kind of risk.

One instance I can share occured this week. A failure of the mains supply occured at one of a customer’s sites, likely caused by the nasty weather here lately, which caused the APC UPS to send a message to all physical servers letting them know to shutdown. All servers with the APC software shut down automatically whilst the Hyper-V hosts were shut down manually – all good so far. Several hours later, the power was restored and everything powered on automatically as it was supposed to. DCs booted, VMs restored from their saved states and so on.

That would be the end of a normal story. What happened next was unfortunate.

As a result of everything being off for several hours, the inclement weather caused the temperature in the server room to drop to about 15C, below the minimum temperature setpoint configured for the environmental monitoring probe on the UPS. This temperature out-of-range event caused a subsequent message to the phsyical servers which triggered another shutdown.

The temperature in the room gradually increased as the other equipment (PBX, SAN, switches, etc) remained powered on – along with the Hyper-V nodes that didn’t have the APC software installed. The physical servers didn’t turn back on again because the automatic power on only triggers when power is restored to the server which a temperature violation doesn’t cause.

The environment contains one physical and one virtual domain controller. The physical domain controller was off at this point as a result of the second shutdown whilst the virtual remained up. Unfortunately, the stop action for the DC’s VM was Save State, not Shutdown. This meant that the DC resumed at the time at which its state was saved – several hours before. This time was then propogated out to the other physical servers and then to the other VMs via Hyper-V’s time sync service causing a time skew that generated a whole load of Kerberos issues.

To resolve the issues, the physical servers were booted up remotely via their iLO interfaces, the domain controllers were resynched with an external time source followed by the Hyper-V machines and then most servers were rebooted to clear any errors and ensure correct service startup. Obviously this took some diagnosing and a lot of manual work which could have been avoided.

To recap, here are the issues:

  • Initial power failure
  • Low temperature alarm
  • APC software not installed on Hyper-V nodes
  • Incorrect stop action on DC VMs

On their own, not issues – together, several hours of early morning headaches!

2012 Core APC PowerChute Network Shutdown

Hit an issue with APC PowerChute Network Shutdown on Windows Server 2012 Core running Hyper-V:

PowerChute cannot communicate with the Network Management Card

PCNS is NOT receiving the data from the NMC.

The client was successfully installing and the IP was registering on the NMC but PCNS wouldn’t connect.


Make sure that the firewall rule “PCNS NMC Communication Port (UDP 3052)” is enabled for all profiles, not just Public which is all that’s selected by default.

Here’s the full set of installation steps:

  1. Install the correct version of PCNS for Windows Server 2012 from the command line.
  2. Connect to the server via an MMC console using the Windows Firewall with Advanced Services snap-in.
  3. For each of the three PCNS rules, open properties, head to the Advanced tab and enable Private and Domain
  4. Connect to https://server:6547/
  5. Run through the configuration wizard as normal

SCOM Hyper-V 2008 2012 Event Log Issue

I’ve been doing some work with System Center Operations Manager (SCOM) 2012 SP1 for a customer lately and was hit by an issue that I couldn’t seem to find an answer on. The environment incorporates both Hyper-V 2008 R2 and Hyper-V 2012 servers and for the latter, the following alert was being fired:

The Windows Event Log Provider is still unable to open the Microsoft-Windows-Hyper-V-Image-Management-Service-Admin event log on computer ‘’.
The Provider has been unable to open the Microsoft-Windows-Hyper-V-Image-Management-Service-Admin event log for 720 seconds.

Most recent error details: The specified channel could not be found. Check channel configuration.

SCOM Hyper-V 2008 2012 Alerts

The same alert was also being generated for Microsoft-Windows-Hyper-V-Network-Admin.


It seems that the MPs for Hyper-V 2008 are incorrectly looking for event logs on 2012 servers which don’t exist in 2012.


To solve this, stop the monitors from targetting the 2012 servers:

Head into the Authoring workspace and then under Management Pack Objects click Monitors. From there, click Scope on the toolbar. Select View all targets then click Clear All at the bottom. Enter “Hyper-V” into the box at the top and then click Select All then OK:

SCOM Hyper-V 2008 2012 Scope MPs

Expand Hyper-V Virtual Hard Disk and Hyper-V Virtual Network:

SCOM Hyper-V 2008 2012 Monitors

Click properties on Mounted Drive Read-only, Port Connectivity and Port Disconnectivity and heading to the Event Log tab will show you the Event Log targetted by the monitors which as you’ll see, are the ones we’re having a problem with:

SCOM Hyper-V 2008 2012 Mounted Drive Read-only Properties
SCOM Hyper-V 2008 2012 Port Connectivity Properties
SCOM Hyper-V 2008 2012 Port Disconnectivity Properties

For each of these three montiors, you need to disable them for the 2012 servers. For Mounted Drive Read-only, first right click on the monitor and choose:

SCOM Hyper-V 2008 2012 Override Menu

You can select the Windows Server 2012 Computer Group for this monitor:

SCOM Hyper-V 2008 2012 Group Selection

For the Port Connectivity and Port Disconnectivity monitors you’ll need to disable them “For a specific object of class: Hyper-V Virtual Network” and pick the objects that relate to each of your 2012 Hyper-V machines. For some reason, picking a group as above doesn’t work.

Reset the unhealthy monitors and clear the alerts and you should be good to go.

My thanks go to Kevin Greene for his post on half of the issue which led me down the right path to solving both alerts.