Friday, July 10, 2009
Why are there so many data center outages?
So far this month we have seen major and prolonged outages in Toronto, Seattle, and Dallas, and, unfortunately, this is a trend that is likely to continue.
Why did it happen now?
Each of the situations was different, but this is generally the time of year when failures happen. Since the external temperatures are warm, electrical loads are at their peak. Although the critical load itself does not have seasonal variation, the amount of power used for cooling the data center, and the rest of the building, if it is a multiple use facility, is at a peak.
Why this year?
A combination of factors. First, age of the equipment. Second, potential lack of maintenance (although I am not pointing any specific fingers). Third, and probably most significant, increased critical load. We have been seeing critical loads in telco hotels and multi-user data centers going up 5 to 10% and sometimes more each year. Once lightly loaded facilities are now edging close to or past the comfort level of 80-85% of rated load. The unfortunate corollary is that this means that we can expect more failures next summer without significant intervention.
What do we do?
To some extent it is starting from the beginning again. I am not pointing fingers at any of these recent failures, but in review of facilities and facility failures we often find that initial commissioning of the facilities is either inadequate or non existent. To keep a facility on line, operators should:
1. Review commissioning and design records to make sure that they are adequately developed and that maintenance, monitoring, and operating procedures are adequately defined.
2. Make sure that all critical system maintenance, including such items as torque checks on bus duct, cycling of breakers, load testing, etc. is current.
3. Know the loads on critical parts of the system and where those are relative to safe working capacity - make sure that maintenance is also appropriate to loads - manage loads down or add additional capacity if required.
4. Make sure that disaster planning and recovery scenarios address a wide range of failure scenarios, including destruction of proximate equipment.
These and related procedures will not eliminate all failures, but they will eliminate many.