Few situations are more frightening for a company than data center outages. Server downtime carries a hefty financial cost, potentially costing a company millions of dollars for every minute their network and data remain unavailable. Apart from the immediate financial costs, the impact of reduced productivity, lost opportunities, brand damage, and potential data loss can have lasting trickle-down effects that can affect a business for years to come.
Most facilities have a good idea of what causes data center outages, but they may not have the right systems and procedures in place to combat them. By properly assessing data center downtime risks and taking proactive preventative measures, colocation facilities can significantly reduce the risks of data center outages.
For all the emphasis on the technical challenges of maintaining server uptime, there’s a substantial body of research that points to human error as one of the top causes of data center downtime. In fact, some of the most high profile data center outages experienced by major companies over the last few years can be traced back to either an accident or outright negligence.
Fortunately, there are more ways to reduce the risk of human error than ever before. Automated systems driven by artificial intelligence are already helping to improve accountability and efficiency in data center operations, eliminating many of the repetitive tasks that are most likely to result in errors or oversights. Along with Improved AI, it is a detailed and disciplined SOP and MOP program coupled with productive training of the supporting technicians and supervisors. Even AI systems can be impacted by power or environmental situations. Once the MOPs’ are completed, they will support the training solidifying the process and procedural discipline. It is critically important to keep the knowledge and skills fresh through interactive involvement with the system, mock outages and ‘dry run’ maintenance and restoral procedures. These service to build confidence and knowledge. Implementing formalized and scalable processes to guide remote hands personnel are also vital because they help to remove uncertainty and confusion from data center operations.
Few data center downtime events create larger headlines than those caused by a cyberattack. Whether it’s a distributed denial of service (DDoS) attack or a ransomware situation, cyberthreats can take a variety of forms and are always evolving to counter the latest security measures. With the growing usage of public cloud services and the proliferation of Internet of Things (IoT) devices, companies need to continually reassess their readiness to confront a potential attack from unexpected places.
The connectivity options available to a carrier-neutral data center make them uniquely suited to confronting the threat of a DDoS attack. Blended ISP connections like vX\defend can provide the redundancy needed to circumvent these attacks without having to compromise network performance. Advanced data analytics that monitor data center operations can also identify suspicious patterns in traffic or unusual network activity that might be associated with a cyberattack. By leveraging their agility, colocation facilities can spot and react to threats before they have a chance to cause data center downtime.
While there’s plenty of talk about virtualized infrastructure and networks, the hardware that makes those high-powered computing resources possible is still very much physical. And like any other piece of equipment, it’s going to wear out eventually. Whether it’s a server reaching the end of its five-year expected lifespan or a UPS backup battery dying before it should, equipment failure is one of the most common causes of data center outages.
Once again, advanced analytics and automated monitoring systems driven by machine learning can come to the rescue. With today’s powerful data center infrastructure management (DCIM) tools, facilities can monitor the overall health of their own equipment as well as colocated assets. While it may not be possible to predict every failure, sophisticated algorithms can monitor equipment performance continually to anticipate when hardware is reaching the end of its lifecycle or is prone to break down. When these problems are identified, data center personnel can plan to switch out faulty or outdated equipment without having to take critical systems offline. With the right redundancies and backups and emergency spares, in place, even an unexpected failure can be managed without compromising network performance.
Although not as common as hardware failures, software-related problems can easily cause data center downtime under the right (or wrong) circumstances. Outdated software may create gaps in security, for instance, or poorly tested operating system patches might corrupt mission-critical applications. Bugs always pose a significant threat, laying a foundation for future errors if they’re not addressed promptly. With many companies running their networks over virtualized servers, the potential consequences of software failure are even greater.
Routine monitoring and updating of critical systems is essential to keeping software applications functioning smoothly. Automated testing that puts software systems through a wide variety of simulations to evaluate readiness and integrity can expose problems and prepare data center personnel to deal with them. By taking nothing for granted when it comes to software compatibility and performance, data centers will be ready to respond in the event that something does fail at a critical moment.
Preparing for human and equipment-related data center downtime is one thing; preparing to deal with the impact of a natural disaster is quite another. Although most data centers have sufficient power backups and connectivity redundancies to deal with anything Mother Nature can throw at them, a good disaster plan must also account for the broader impact of the event. Sure, the facility may not lose power, but will the roads around it be flooded? How long will it take for the local power grid to come back online?
The best strategy for avoiding data center outages related to natural disasters is to locate the facility in a relatively safe area. Coastal areas and flood plains pose a significant risk, as do areas prone to tornadoes and wildfires. When a facility is exposed to these threats, it’s important to have both a disaster prepared plan and a disaster recovery plan. In preparation for a potential event involves testing all emergency systems both for functionality and monitoring/alarming. All personnel should be trained and certified by the local disaster recovery organization. Accommodation for quarters, food, and other necessities on hand. All redundancy functions should be exercised and operational confirmed. All safety measures must be in place before an event occurs. If the area has unpredictable events such as tornados or earthquakes, maintain an aggressive schedule for testing preparedness. All items or data which can be stored off site must be maintained to the same standard. This ensures that even if data center outages occur, customer data will remain accessible to diminish the potential impact of data center downtime.
Taking steps to avoid data center outages should be the highest priority for data center managers, whether they’re operating a private or colocation facility. Fortunately, today’s data centers have more tools than ever before that help them to fortify their infrastructure and keep their systems up and running to deliver superior levels of server uptime.
Ross is a Regional Vice President, Operations at vXchnge and is responsible for managing all 14 data center locations. With more than 30 years of experience, Ross has managed data center construction, engineering, repair and maintenance, leading him to the emerging business of colocation.