What’s driving the need for ‘black box’ technology in power converters?

Blog • March 15, 2022

When the ‘Uptime Institute’ polled 152 data center managers in 2020 [1] on what was the primary cause of their organization’s most recent significant outage, 37% said ‘power’. The Uptime Institute also reported that 16% of 146 managers polled in 2020 estimated the cost of the outage as more than $1M. Historically there have even been outages costing over $100M.

With increased reliance on ‘the cloud’ for everything from social media to banking, it has therefore never been more important to avoid unplanned outages. Of course, there are established techniques to achieve this, typically employing redundancy in functionality and particularly in power supplies, which can otherwise be a ‘single point’ of failure. However, knowing the degree of redundancy, the reliability of the remaining functioning equipment and the acceptable maximum time to repair or change-out faulty equipment is vital – having back-up which itself has a high failure rate is of limited use if redundancy can’t be reinstated quickly.

Redundancy is effective – but only if you monitor

It's a basic rule of redundancy that you must know if the redundant unit has kicked-in after a failure. Otherwise, you would be unaware for an indeterminate time that a further single failure could take down the whole system. Therefore, it is common that power rails are monitored before any ‘gating’ diodes to provide health information, prompting repair or replacement. It’s another basic rule that any monitoring that is aggregated does not in itself represent a hazard by being a common connection to redundant elements that could induce a failure in all equipment monitored, for example, by injecting a high voltage after an insulation failure.

Many power converters in critical positions will have a ‘DC OK’ or ‘Power Good’ signal which can be used to signal part of a redundant power system falling out of specification. However, modern converters often also have a degree of digital control and monitoring, which can not only provide an alarm on failure, but also a ‘snap shot’ of the converter conditions as the failure is registered. This can include actual output current and voltage and critically, the temperature of the part. The function now becomes analogous to a ‘black box’ event data recorder. The information can typically be interrogated over an I2C interface using PMBus® commands.

A further enhancement is to write the data to non-volatile memory (NVM) in the converter, so that even if its powertrain is catastrophically damaged, diagnostic data may still be recoverable. The principle also applies to non-redundant arrangements where sudden loss of function caused by a failed converter or load may be tolerable in the short term, but it would still be useful to know the conditions under which failure occurred. If it is the load that has failed to function, signaled in some way, the power converter current and perhaps temperature monitoring might also give a clue as to what has happened.

To achieve this, ‘time stamping’ of the recorded data from a power converter can be used to associate a failure log with external events. For example, if a load goes short-circuit, the converter will shut down and could log the event and time for later correlation with some other external event. An example of a power converter with this version of one-time programming ‘black-box’ functionality is Flex Power Modules’ BMR350 series. This is a 1200 W peak, baseplate-cooled, quarter-brick DC/DC converter with a PMBus® interface allowing access to the ‘event data recorder’ information collected under failure conditions.

Picture: Flex Power Modules’ BMR350 with inbuilt OTP fault event data recorder


Other power converters have continuously operating event recorders

An extension to this is to continuously monitor and time-stamp ‘life events’ during normal operation. This could be cumulative operating hours, unusual current demands, self-resets after an overvoltage transient or a range of other parameters. Data could be used to identify trends such as gradual temperature increase due to blocked fan filters or loss of power conversion efficiency over time. With analysis, wear-out failures could be predicted, and Condition Based Maintenance (CBM) can be implemented. This is a regime where parts are replaced when they need to be and before failure, rather than at arbitrary fixed intervals or only when failure occurs. This saves labor and hardware costs while maximizing up-time.

An example of a power converter with this ability to continuously over-write the event data recorder is the Flex Power Modules BMR491. This is a 2450 W peak, baseplate-cooled, quarter-brick DC/DC converter with a PMBus® interface again allowing access to the ‘event data recorder’ information.


Picture: Flex Power Modules’ BMR491 with inbuilt rewritable life event data recorder


The ‘Uptime Institute’ makes the point that hardware reliability is improving, but rate of roll-out and reliance on new data centers is such that monitoring, lifetime prediction and fault diagnosis are still an increasing concern. ‘Black-box’ functionality in power converters is set to help.

Reference

[1] https://uptimeinstitute.com/an...