What Makes a Data Center Fault Tolerant? | TRG Datacenters (2022)

Data availability and uptime are now primary concerns for businesses in all industries. With increasing numbers of companies now relying on digital systems for the vast majority of their processes, the focus on data availability is becoming ever more important. As a result, we’re seeing far more conversations about how to achieve the very best levels of SLA uptime, and which processes companies should be putting in place to protect themselves from the damage that unexpected downtime could potentially cause.

Fault tolerance is one of the key talking points amongst IT professionals today, but relatively few outside of the IT sector have a good understanding of what this term really means, particularly in the context of data centers. With fault tolerance becoming increasingly important as time goes on, it’s worth taking the time to understand what is meant by the term, and how a good level of knowledge around fault tolerance could result in more reliable systems for your entire business.

What is a fault tolerant data center?

The phrase fault tolerant is often used to describe data centers. Seen as a standard of quality and a sure sign of reliability, a fault tolerant data center is one that has no single point of failure. Facilities are purpose-built to avoid such a point of failure and fully equipped with a range of technology that significantly improves the fault tolerance of the center as a whole.

A high level of fault tolerance can make a real impact in terms of the reliability of a data center, but it’s not the only thing that companies need to consider. Datacenter downtime can also be avoided by practicing fault avoidance. The use of continuous monitoring systems, good training practices, and meticulous maintenance all come together to help prevent any faults from occurring, thereby keeping downtime to a minimum.

Data centers like ours are built with fault tolerance in mind. TRG’s facility has been built to avoid any single point of failure.

Understanding the tier system

A tier system has long been used to help explain the capabilities of different data centers. The system is composed of four tiers, with each one giving a clear indication of the performance of different sites. The four levels of the system include Tier I (Basic Capacity), Tier II (Redundant Capacity), Tier III (Concurrently Maintainable), and Tier IV, which is the tier that denotes fault tolerance. Let’s take a closer look at what these tiers mean.

Tier I: Basic Capacity

Tier I data centers are amongst the most affordable options. While they do not provide the high levels of fault tolerance that Tier IV centers will, they are usually sufficient for the needs of companies looking for a basic level of support for existing systems. These data centers tend to include features like cooling equipment, engine generators, and an uninterruptible power supply.

(Video) Understand Managed Datacenters in 2 Minutes

Tier II: Redundant Capacity

The basic level of service that Tier I data centers provide is improved by those in the Tier II bracket. These data centers also include power and cooling components, which help companies to complete maintenance tasks without disrupting systems. Such components are also useful in limiting the chance of any downtime caused by equipment failures.

Tier III: Concurrently Maintainable

Tier III data centers provide a clear benefit to companies that are always looking to expand and improve the service they offer. They are built in such a way that shutdowns are never required during maintenance tasks, and equipment can be replaced with no need for any downtime at all. This is achieved through the addition of a redundant delivery path, which is used for power and cooling, alongside all the redundant critical components of a Tier II data center.

Tier IV: Fault Tolerance

The highest level of reliability and security is provided by Tier IV data centers. Widely known as fault tolerant data centers, these facilities have to have two parallel power and cooling systems. This means, should any equipment failures or interruptions occur, the center’s generators, cooling systems, double electrical rooms, and purpose-designed infrastructure will completely minimize the risk of downtime.

The Importance of Fault Avoidance

While infrastructure plays a big role in ensuring data center availability, the biggest improvements in uptime are found when facilities look beyond fault tolerance and start practicing fault avoidance. In fact, tiers can mean little in terms of data center availability without fault avoidance.

(Video) What are the core principles behind Google data centers?

Simplified, fault avoidance aims to limit downtime considerably, with an approach that centers around prevention rather than a cure. Years of experience operating data centers has taught us that downtime can be avoided altogether with the right level of monitoring, thorough maintenance, and well-trained personnel.

24/7 Staff

A 24/7 facilities team and designated Primary Alert Watcher (PAW) provide continuous monitoring, a vital part of any good fault avoidance strategy. This ensures that any issues are picked up on quickly, and an immediate response can be organized. As a result, more serious problems will be avoided, and downtime can be minimized.

Monitoring

Building Management System (BMS) and Building Automation System (BAS) are two of the most important tools when it comes to data center monitoring and practicing active fault avoidance. In simple terms, a BMS lets operators monitor systems and gather insights from them whereas BAS goes a step further, offering automated responses based on data insights. These automatic responses often include control over ventilation, cooling, heating and more. Both of these systems also use Programmable Logic Controllers that let operators monitor equipment individually or the building in its entirety.

Balancing Predictive and Preventative Maintenance

Maintenance should be a key consideration for businesses hoping to avoid downtime. In fault avoidance, having a maintenance regime is crucial for preventing incidents before they occur. There are two main types of maintenance:

(Video) Yuliang Li (Harvard) - "Sundial: Fault-tolerant Clock-synchronization for Datacenters"

  • Preventative – Regularly scheduled maintenance undertaken on the advice of suppliers
  • Predictive – Monitoring equipment and leveraging data to understand where the most likely point of failure will be

Practicing effective fault avoidance involves finding a healthy mix of both of these regiments.

Formalized Training

Human error is of course another leading cause of downtime, which is why this too should be part of any good fault avoidance strategy. Businesses practicing fault avoidance will need to prioritize staff training and formalize Methods of Procedure (MOPs) to be followed in the event of an incident. These procedures should always be peer-reviewed, and they must include clear guidelines as to when team members should stop any interventions to minimize risk.

The Commercial Considerations of Fault Tolerance

What is the cost of implementing fault tolerance?

When thinking about what goes into making a data center fault tolerant, it’s also important to consider the commercial practicalities. When it comes to fault tolerance in small scale facilities, the designs are usually delivered in a 2N capacity. This means the costs are essentially doubled. Adding the costs of maintaining these systems (which will require specialist 24/7 teams to support) and it generally won’t make sense to implement for capacities below several megawatts (around 4-5 is the threshold). In these instances it is much more economically viable to utilize a larger colocation data center that can benefit from economies of scale.

The fact remains, large data center providers can achieve fault tolerance at a much lower overhead cost. As a result, customers can benefit from better value for money. Economies of scale let larger facilities invest in 24/7 staffing for better continuity of service. Customers then also get the peace of mind that best-in-breed experts are working to provide these services.

Ultimately, most large scale data centers are able to provide Tier IV facilities at the same cost it takes to build a Tier II data center yourself.

(Video) OSDI '20 - Sundial: Fault-tolerant Clock-synchronization for Datacenters

Your time

Another core consideration should be that of you and your team’s time, and where is it best spent. The most successful results come when a team does what they are best at – and doesn’t waste time undertaking tasks outside of their own capabilities.

Implementing fault tolerance requires a high level of facilities management and critical systems design. Ask yourself if this is within your organization’s core competencies because, if not, fault tolerance could be little more than a distraction that keeps you away from your core purposes. Even worse, it could turn into a costly mistake that requires a complete rework. This is another reason why larger data center providers benefit from having the time, resources and expertise to invest into achieving fault tolerance at a high level.

What About The Cloud

We can’t think about fault tolerant data centers without also addressing The Cloud. You’ll find the most successful organizations will use a hybrid strategy for fault tolerance. This essentially means hosting the primary system in a colocation facility, and then using The Cloud as a backup target in a disaster recovery location.

Looking at The Cloud and its role in fault tolerance, business continuity and disaster recovery the fact remains that there are still various risks involved – especially when we think about it at hyperscale. Any error or problem could potentially bring the whole system down. This is one of the instances where scale can also work against you as the risks get larger the more you scale up in The Cloud.

This is why it’s still not always the best idea to be on a giant shared computer grid where one error can throw the whole thing off. In our experience, best practice is to host systems in a locally provided colocation data center, and then leverage a hybrid strategy for backups.

You’ll find colocation data centers will have the features and service levels to provide cloud-like experiences. In TRG’s case, we have cloud on-ramps and multi-site capabilities that let us provide the same fault tolerance as cloud providers.

(Video) Data Center Tiers Explained

Make downtime a thing of the past with a fault tolerant data center

Our data centers are all designed with fault tolerance and fault avoidance in mind, offering everything ambitious organizations need to ensure their work is never interrupted. If you’d like to hear more about what a fault tolerant data center could do for your company, or are interested in exploring the options further, contact us.

FAQs

How do you ensure fault tolerance? ›

To ensure fault tolerance, enterprises need to purchase an inventory of formatted computer equipment and a secondary uninterruptible power supply device. The goal is to prevent the crash of key systems and networks, focusing on issues related to uptime and downtime.

What are fault tolerant servers? ›

Fault-tolerance describes a superior level of availability characterized by 5 nines uptime (99.999%) or better. Fault-tolerant systems are able to deliver these levels of availability, because they can “tolerate” or withstand both hardware and software “faults” or failures.

What needs to be created to make a server fault tolerant? ›

Hardware systems are backed up by systems that are the same or equivalent. A server, for example, can be made fault-tolerant by operating two identical servers in parallel and mirroring all actions to the backup server. Software systems that are backed up by additional instances of software.

Which of the following are features of fault tolerance? ›

A fault tolerant system may have one or more of the following characteristics:
  • No Single Point of Failure. ...
  • No Single Point Repair Takes the System Down. ...
  • Fault isolation or identification. ...
  • Fault containment. ...
  • Robustness or Variability Control.

What is a good example of fault tolerance? ›

A twin-engine airplane is a fault tolerant system – if one engine fails, the other one kicks in, allowing the plane to continue flying. Conversely, a car with a spare tire is highly available. A flat tire will cause the car to stop, but downtime is minimal because the tire can be easily replaced.

What are the two basic concepts used to make systems fault-tolerant? ›

Fault-tolerant software assures system reliability by using protective redundancy at the software level. There are two basic techniques for obtaining fault-tolerant software: RB scheme and NVP. Both schemes are based on software redundancy assuming that the events of coincidental software failures are rare.

What are the types of fault tolerance? ›

A fault-tolerant system may be able to tolerate one or more fault-types including -- i) transient, intermittent or permanent hardware faults, ii) software and hardware design errors, iii) operator errors, or iv) externally induced upsets or physical damage.

What is difference between fault tolerance and high availability? ›

The difference between fault tolerance and high availability, is this: A fault tolerant environment has no service interruption but a significantly higher cost, while a highly available environment has a minimal service interruption.

What is fault tolerance in AWS? ›

Fault tolerance is the ability of a workload to remain operational with zero downtime or data loss in the event of a disruption. In a fault-tolerant environment, instances of the same workload are typically hosted on two or more independent sets of servers.

How do you provide fault tolerance for data center equipment failure? ›

To be fault-tolerant, a data center must have two parallel power and cooling systems with no single point of failure (also known as 2N). Building or co-locating in a tier four center is hardly cost-effective for most companies, and the jump from tier three to tier four provides a marginal gain in availability.

What are issues in fault tolerance? ›

For a system to have this property, many separate issues are involved: fault confinement, fault detection, fault masking, retry, diagnosis, reconfiguration, recovery, restart, repair, and reintegration. These issues are discussed, and are applied to two well-known fault tolerance distributed systems.

How is fault tolerance managed in computer networks? ›

Fault tolerance is reliant on aspects like load balancing and failover, which remove the risk of a single point of failure. It will typically be part of the operating system's interface, which enables programmers to check the performance of data throughout a transaction.

What is fault tolerance in big data? ›

Fault tolerance is the property of a system that maintains continuous running of service even during faults. Fault-tolerant systems are built on two main key concepts: fault detection and recovery [44]. These two concepts can be achieved based on various fault-tolerance approaches as classified in Fig.

Which of the following best describes fault tolerance? ›

The ability of an application to automatically correct user mistakes.

Which one of the properties is not a requirement for fault tolerance? ›

3. Which one of the property is NOT a requirement for Fault Tolerance? Explanation: None.

How does fault tolerance work? ›

Fault Tolerance avoids "split-brain" situations, which can lead to two active copies of a virtual machine after recovery from a failure. Atomic file locking on shared storage is used to coordinate failover so that only one side continues running as the Primary VM and a new Secondary VM is respawned automatically.

How will you achieve fault tolerance in cloud? ›

Fault tolerance in cloud computing is about designing a blueprint for continuing the ongoing work whenever a few parts are down or unavailable. This helps the enterprises to evaluate their infrastructure needs and requirements, and provide services when the associated devices are unavailable due to some cause.

What is difference between fault tolerance and fault resilience? ›

These distinctions are important, because it is possible to regard a fault tolerant service as suffering no down time even if the machine it is running on crashes, whereas the potential data fault in a fault resilient service counts toward down time.

How fault tolerance in a network is achieved? ›

The Internet is fault-tolerant because there are usually multiple paths between devices, allowing messages to sometimes be sent even when parts of the network fail.

How will you achieve fault tolerance in cloud? ›

Fault tolerance in cloud computing is about designing a blueprint for continuing the ongoing work whenever a few parts are down or unavailable. This helps the enterprises to evaluate their infrastructure needs and requirements, and provide services when the associated devices are unavailable due to some cause.

How can we ensure that a new software product is fault free? ›

Fault detection can be achieved through various validation techniques. This includes devising comprehensive test cases, continuous integration and testing, cross-verification using traceability matrix, automated testing, and so on.

How do you achieve fault tolerance in Microservices? ›

The solution to this problem is to use a fallback in case of failure of a microservice. This aspect of a microservice is called fault tolerance. Fault tolerance can be achieved with the help of a circuit breaker. It is a pattern that wraps requests to external services and detects when they fail.

Videos

1. ANALYZING RELIABILITY IN THE DATA CENTER
(Easy Technology)
2. Data Center Infrastructure Design Webinar l IEEE LAU Student Branch
(Knowledge Base)
3. Data Center Commissioning: What you need to know
(Consulting-Specifying Engineer)
4. Cost Efficient Design Of Fault Tolerant Geo-Distributed Data Centers-1Crore Projects
(1 Crore Projects)
5. Data Centre Interconnect "DCI" evaluating options and new trends
(Mohamed Radwan)
6. Fault-Tolerant, Secure Large Scale Data Management in a New World by Amr El Abbadi, UCSB
(Fourth Paradigm conference)

Top Articles

Latest Posts

Article information

Author: Fredrick Kertzmann

Last Updated: 12/03/2022

Views: 6070

Rating: 4.6 / 5 (46 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Fredrick Kertzmann

Birthday: 2000-04-29

Address: Apt. 203 613 Huels Gateway, Ralphtown, LA 40204

Phone: +2135150832870

Job: Regional Design Producer

Hobby: Nordic skating, Lacemaking, Mountain biking, Rowing, Gardening, Water sports, role-playing games

Introduction: My name is Fredrick Kertzmann, I am a gleaming, encouraging, inexpensive, thankful, tender, quaint, precious person who loves writing and wants to share my knowledge and understanding with you.