Before you leave, get your free copy of our Data Center Migration Checklist
Use this checklist to help protect you investment, mitigate potential risk and minimize downtime during your data center migration.
Companies often invest millions of dollars into their IT infrastructure, building systems to their exact specifications to deliver the performance their business needs. They can go to great lengths safeguarding the system’s security, building in multiple paths of redundancy, and backing up both servers and offsite data to ensure that availability is maintained at all times. They use predictive analytics to anticipate user traffic and its impact on power and cooling demands to better manage their system’s health and optimize its performance.
And then a technician accidentally switches the data floor’s temperature controls from F° to C° and suddenly it turns into a sauna. The cooling system can’t handle the increased temperature and the servers overheat, taking the whole network down with it.
As technology has become a huge part of most companies’ business operations, maintaining an effective IT infrastructure is crucial to success. Since that infrastructure is implemented and managed by fallible human beings, however, it’s worth considering how different types of human error can impact an organization’s data operations.
Sadly, human error is the largest cause of server downtime, accounting for as much as 75% according to some reports. The reasons for this are both obvious and surprising. On the one hand, many industries struggle with reducing human error, so it’s not all that surprising that a complex field like IT would be particularly vulnerable to many types of human error. But on the other hand, it seems as though little has been done to address the problem. If it’s taken for granted that human error occurs frequently in the IT space, then something should be done to implement more rigorous training and procedures to reduce those rates. Most data centers even factor it into their SLA uptime calculations.
Many organizations make the critical mistake of seeing an error as a one-time event. There is value, however, in turning human error examples into data points. By tracking what kind of mistakes were made and when, companies can begin to develop a better picture of the vulnerabilities and gaps in their operations. This approach has the added benefit of encouraging people to report errors rather than hide them. The purpose isn’t to punish people for their mistakes, but rather to learn from them in order to make improvements. Near-errors should also be included in this reporting; just because a mistake didn’t cause the server to crash this time doesn’t mean the same mistake won’t take it down next month.
In many cases, training and procedures are actually part of the problem. If someone makes a mistake, they’re typically required to review proper procedures before returning to work. But if the same problems keep occurring, the real problem might not actually be human error, but ineffective processes. Developing a facility’s standard operating procedures (SOPs) may sound like an easy task, but they need to be written in such a way that people can comprehend them, understand why they’re important, and follow them accordingly.
This isn’t something that just happens overnight. Take, for example, the amount of training the US Navy puts sailors through in order to crew a nuclear submarine. Processes are laid out clearly and simply before they’re drilled in through seemingly endless repetition. All too often, IT departments don’t perform regular exercises to practice how to respond in a critical situation. In some instances, there may not be a policy at all, with IT taking more of a reactive role to simply fix problems after they develop rather than trying to head them off at the first sign of trouble.
If the same types of human error keep occurring in an IT department, chances are that the problem has less to do with personnel and more to do with poor training or poorly designed development tools. Many facilities have even taken to designing their systems to accommodate human error, building in failsafes and automated tasks to keep humans out of the equation altogether. While this can be effective, it’s also a bit short sighted and wasteful. There’s little reason to believe that a better designed SOP and improved training couldn’t go a long way toward reducing human error in a data center environment.
Most of the time, human error is simply the result of someone making a mistake or being in the wrong place at the wrong time. In this respect, human error is distinct from “malicious intentions,” which refers to someone taking willful action to inflict damage to a company or its customers. Of course, the distinction usually doesn’t matter to people who must suffer the consequences of the mistake. With so many high profile system outages and cyberattacks, customers and businesses want to know that their valuable data is being kept secure.
Humans form a uniquely weak link in the IT infrastructure chain. Factors like fatigue, sickness, or simple distraction can cause people to lose focus and make mistakes at inopportune moments. Carelessness with passwords, access protocols, and devices can also introduce a variety of security threats into an IT environment. In many cases, people may not even realize their actions pose a security risk.
While human error examples are typically attributed to IT personnel, human-related security risks can come from anywhere within an organization. From losing a company computer that has access to valuable cloud-based data to opening a malicious phishing email, there are many ways that people can potentially compromise data security. Every employee within an organization needs to be made fully aware of proper procedures and policies regarding devices, network access, and best security practices to avoid data breaches due to several types of human error.
As technology continues to be an important element of every company’s day-to-day operations, the potential costs of human error have increased commensurately. Organizations need to put more thought and resources into training IT personnel on a regular basis to put them in the best position to carry out operations without disruption. Redesigning SOPs to make them easier to follow and better suited to the realities of the IT environment can also help to reduce mistakes. Better security education throughout all ranks of the organization is also critical to avoiding breaches due to carelessness.
As the Marketing Manager for vXchnge, Kaylie handles the coordination and logistics of tradeshows and events. She is responsible for social media marketing and brand promotion through various outlets. She enjoys developing new ways and events to capture the attention of the vXchnge audience.