Although organizations often take every precaution imaginable, the threat of server downtime is difficult to fully eliminate. With even a few minutes of downtime likely to cost dearly in terms of lost productivity and opportunity, companies are turning to data centers to keep their mission critical network systems up and running no matter the circumstances. For some industries, downtime is a minor inconvenience, but for others, it can cause serious disruptions that have lasting consequences.
Identifying the primary causes of server downtime is the first step in establishing policies and procedures to deliver reliable services. While there are myriad ways for data center servers to go down, most failures can be broken down into one of five categories.
Various studies over the last several years have placed human error as either the most frequent or second most frequent causes of server downtime. Whether through accident or negligence, many of the highest profile service outages of the last few years can be directly traced back to human error. While it’s impossible to guard against human error completely, data centers and other organizations can take significant steps toward reducing the likelihood of error and increasing accountability to deal with problems when they do occur.
Some of these measures include accurate documentation of routine tasks, imposing more stringent policies on device usage, and ongoing continuing education to reinforce processes and policies. As automation through artificial intelligence and predictive analytics becomes more common in data centers, the threat of human error may be diminished as a consequence.
One of the more high profile downtime causes, cyberattacks usually make big headlines when they occur. Network vulnerabilities create opportunities for hackers to infiltrate systems, allowing them to steal data, shut down applications, and lock out users with ransomware. Even if a system is relatively secure, it may still be vulnerable to a distributed denial of service (DDoS) attack that can paralyze and crash servers that aren’t prepared to withstand the traffic spike. For many organizations, even the threat of such an attack is enough to cause them to cave in to hackers extorting “protection fees.”
With the proliferation of Internet of Things (IoT) devices, the overall attack surface of many companies’ networks is increasing. While there are many ways these devices can be used to enhance security, they do pose a risk if they’re not adequately secured. Testing and simulations using predictive analytics can help to identify vulnerabilities in network infrastructure, and sophisticated algorithms can monitor and log suspicious activity to provide higher levels of security against cyberattacks.
Sometimes equipment just breaks. It’s an unpleasant truth, but data center physical infrastructure is always vulnerable to failure of some kind, making it one of the leading causes of downtime. Whether it’s a server going down, an uninterrupted power source (UPS) battery failure, or a data center cooling system malfunction, hardware presents a wide range of potential problems for IT departments and data center personnel. Part of the challenge here is that many failures can’t be predicted. While predictive analytics can identify some problems and estimate when some equipment is due to fail, unexpected events can often trigger widespread outages.
Outdated hardware is particularly vulnerable to failure, leading many companies to blame service outages on “old servers.” Many organizations have chosen to forego the cost of updating these systems and instead turned to data center as a service (DCaaS) offerings from data centers with more up-to-date equipment along with many built in redundancies. Although data centers have not proven completely immune to equipment failure problems, they typically have enough redundancies in place to keep downtime to a minimum.
Although less common than hardware failures, network systems are only as effective as the software they’re running. When operating systems are updated with patches that haven’t gone through proper testing, entire applications can become corrupted and bring networks screeching to a halt. However, outdated software is often just as problematic because it lacks current security measures or drivers to keep high traffic networks up and running. Bugs in operating systems also present vulnerabilities that are easily exploited by malware. In any case, software remains one of the more pervasive causes of downtime.
The move to server virtualization has been beneficial for solving server problems, but it also means that there are more applications running in a network, many of which have the potential to create problems for other applications. To combat the risk of software failure, companies like Netflix operate on the assumption that mission critical software will fail and run various simulations and experiments to ensure that they’re ready to cope with the problem in the event of a software failure.
While not quite as catastrophic a threat as they may sound, natural disasters still pose significant dangers to networks. Modern data centers have extensive safeguards in place to protect their operations from the effects of hurricanes, flooding, and earthquakes. Backup and redundant systems provide a data center facility with reliable power and cooling. In recent weather events, such as Hurricane Harvey in 2017 and Hurricane Sandy in 2012, the data center facilities held up quite well, but many of them faced difficulties due to the condition of the local infrastructure around them. Disrupted power services and inaccessible roads in the wake of the storms posed more of a threat than the storms themselves.
Smaller weather events like lightning strikes and excessive heat have actually proven to be more serious causes of downtime than scarier sounding events like hurricanes. As the demand for data centers increases and more facilities are built in less hospitable locations, strategies for dealing with both high profile natural disasters and more common events like lighting, tornadoes, and wildfires will become increasingly vital for maintaining service uptime.
Organizations need to have a plan for coping with server downtime. Even if they take every precaution to protect their own systems, they must also plan how to respond if their cloud provider or other service provider experiences a significant outage. Considering the potentially high and damaging costs of extensive downtime, companies must think long and hard about how to keep their services up and running as much as possible and the processes to bring critical systems back on-line should they falter.