Linux — High Availability Cluster Management — Part 1 (Theory)

Guilherme Lopes
3 min readApr 20, 2021

In this post, I will summarize what I’ve learned in the Course Linux — High Availability Cluster Management by David Clinton in Pluralsight, as well as add personal insights and expand the content.

This post will be split in two parts. Part 1 will cover mostly theory and set the base to understanding the most important concepts. Part 2 will be dedicated to practice.

Introduction

The first point covered and used as an introduction to what high availability means is the problem of having a Single Point of Failure (SPOF). To understand this simple concept we just need to think the following:

Imagine we own a website page hosted on a laptop connected to the internet which is solely responsible for handling all of our user's requests. Someday you may accidentally hit the cable and cut the connections. This is all that is needed to have a small period of downtime and possibly some angry clients emailing you. The cable in our example would be our SPOF.

Single Point of Failure — a failing connection would cause downtime in the system

So how do we avoid having a SPOF in our system? One of the most common approaches is to have multiple servers replicated, behind a load balancer and distribute the load among these servers. In this case, the system would only stop working if all servers stopped working at once, which has a much less likely probability.

A system with Redundancy — continues working even if one server is down

Base Concepts

Nowadays, with the advent of cloud computing and hardware on demand, it is easy to get virtual servers via providers like AWS or Azure (there are multiple others) and scale up or down the number of servers on need. But before we dig deeper on how to set up a system for testing, let's first clarify some terminology.

  • Node: independent Virtual Machine (VM) or physical server. Can also be called a POD.
  • Cluster: a group of node peers
  • Server Failure: state of unresponsiveness of a node(s)
  • Failover: reassignment of a task to a new node when there is a server failure
  • Failback: recovery of a node
  • Replication: distributed data
  • Redundancy: reserved environments
  • Split Brain: failed communication error state
  • Fencing: shutting down unresponsive nodes
  • Quorum: numeric requirement of to decide if some task should proceed

Now that we understand the terminology, lets go from the bottom. Which types of clusters can we have?

  • Active / Active — all clusters try to be active
  • Passive / Active — some clusters are not used actively, they are called failover for when server fails.

As explained before the routing is done through a load balancer.

Now, imagine if we had several nodes or even cluster but they all connect to the same database. In this example called Shared Disk Cluster we still have a SPOF. To solve this, it is common to replicate the database and add redundancy so we achieve a system architecture called Shared-Nothing Cluster(SNC). It is important to paint that every system architecture has trade-offs, wo in the case of a SNC the request can take a little longer but the high availability is more important for the system.

Availability

According to bmc blogs:

Availability refers to the percentage of time that the infrastructure, system, or solution remains operational under normal circumstances in order to serve its intended purpose

Mean Time Before Failure (MTBF)

Here we basically have the mean time before a part of our system fails. This can be related to:

  • Lifespan of storage drivers, RAM, Motherboards, etc
  • Reliability of Internet connection
  • Reliability of Software
  • Frequency of operator error
  • Property crime rates
  • Frequency of power outages

Mean Time To Repair (MTTR)

Time need to, after some failure (e.g one of the mentioned above), repair the system.

  • Time needed to get operator on-site
  • Location of backup images and date
  • Access to replacement components
  • Access to needed human expertise

We can calculate availability using the following formula:

A = MTBF / ( MTBF + MTTR )

This is the end of Part 1. More content can be added as time goes by but for now this is enough theory for us to start the practice and see what we have learned in action.

--

--

Guilherme Lopes

Tech enthusiast. DevOps, Infrastructure and SRE are my areas of expertise.