Configuring High Availability [ClearOS Documentation]

Configuring High Availability

This howto focuses on specific features of ClearBOX that provide high availability. Particularly, ClearBOX uses network bypass that is capable of creating inline fail-open capability, and fail-over capability. Moreover, it is uniquely designed to handle fencing. This makes ClearBOX a unique platform for delivering high availability in a self-enclosed method.

We've provided some definition of how we will use terms in this article and on this site. Some of these technologies can be complex and difficult to implement. If you need an expert consider posting questions to the forum or using ClearCARE to resolve your issues.

Nearly every solution mentioned in this article requires a significant command line input and often uses methods that are just starting to be engineered on ClearOS. This is not a precise howto and many of the evolving links will be rough documents that still require significant familiarity with linux and clustering concepts. Beware of the leopard.

Network Bypass

Network Bypass is a technology which allows two network devices to cause their physical interfaces to effectively 'short' themselves. This means that when a condition is reached the 8 pins for two network cards join together and simultaneously disconnect from the chipset of the network interface. The result being that the network interfaces are cumulatively a coupler.

Fencing

Fencing is the process by which a member of a cluster or a fail-over group is removed from connectivity by intervening at the communication, power or other critical point. This is important because clustered members can cause significant damage to data and data flow if allowed to continue while in a failed or semi-failed state. Fencing allows one server to prohibit the other server from interaction with the data or network if it detects the failure on the other server.

Because ClearBOX uses network bypass, a secondary server can effectively fence the primary by interrupting its communication with the network.

Fail-open

Fail-open is a condition whereby a contiguous network path is supplied in the event of a failure. Fail-open is commonly used where the device (ClearBOX) is running as a network bridge. In this bridge mode the server may be performing firewall, or gateway services like content filtration or protocol filtering. When a fail-open condition occurs the hardware will cause the bypass to short and turn the device (ClearBOX) into a glorified coupler.

Additionally, servers in this mode can be applied in either series, or parallel as with the failover configuration to apply the desired path and conditions.

Fail-over

Fail-over is a condition where all services supplied by the device are handed over to a secondary device that is similarly configured. In this state, the primary device may be malfunctioning, down, or even off. In this state the back will assume all operations. This situation is sometimes referred to Active/Passive.

In this situation the primary server is on the top of the image and the failover is on the bottom. This allows bottom server to short circuit the path and effectively fence the top server in case of a failure.

Active/Active

This is often the most beautiful and elegant solution to clustering. It is also as hard to achieve as some peace negotiations between waring factions. It supposes that your services, your network paths, protocol design and clients are so open to the concept of getting along and that they were engineered with such keen foresight into the future as to appear super intelligent.

Some early and common uses of active include DNS which was engineered to allow a first come, first serve approach to returning data, contained replication mechanisms within the protocol and data hierarchy. Sadly, it is not tenable or efficient for all data systems to work this way because many systems need more authoritative, responsive, and realtime solutions.

Configuring Heartbeat

The steps for configuring heartbeat on ClearOS (and in particular ClearBOX) can be found here

Beyond Heartbeat

Heartbeat is only one element of high availability. Additional elements to consider are:

Clustered Data support in volume management
Data Replication
Service Cluster Support
Data integrity and avoiding split-brain activity
STONITH
Fail-Open, Active/Passive, Active/Active, other clustering options.

Keeping it simple

There is often a temptation to throw everything into redundancy when redundancy is an option. Be careful when considering what to make redundant and whether to throw all your eggs in one basket (meaning that more complexity can cause an increase in downtime).

For example, let's say that you put together a failover condition for your firewalls and content filtration. Your business requires them to be mission critical and you know that there are costs associated with a failover condition (sometimes you can lose state on a packet or cannot transition a session properly). You realize that this hardware is under utilized and you decide to place your database and web application on this same platform. Later you find that your gateway is failing over often because your web application causes the heartbeat to sometimes believe that it is down and needlessly transitions the boxes to a fail-over state.

Another important rule is don't use high availability when a service or protocol does NOT need it. For example, configuring you master LDAP server to failover automatically to another 'master' LDAP server is unnecessary because it is a simple and scalable solution to configure a replicate to take-over in a permanent failure situation. Additionally setting up a failover DNS server would be pointless because the service is already redundant when properly configured as master/slave.

The usual suspects

The typical things that greatly benefit from clustering and high availability are:

Critical path: This is where you have certain data that must flow a certain way and the the failure of the server in that path is detrimental. This may include firewall, protocol filtration, content filtration, proxy or IDS/IPS. Typically fail-over, or fail-open methods are used.
Critical data: Consistent, accurate, dynamic data is difficult to scale across disparate data sources. Among the challenges include correct versioning, locking, locking release, and split-brain. Once you overcome these barriers you can do some amazing things with clustered data including multiple points of access, increased speeds, resiliency and redundancy, and much more.
Critical service: This is often tied to the hip of critical data because often the service is reliant on data. However, critical data is often deployed in active/passive modes and critical services can often escape that condition and be produces as active/active.

Critical Path

This is by far the easiest clustering solution on ClearOS at this time. ClearOS is powerful as gateway solution and failing over to a secondary server is somewhat trivial when using a solution like ClearBOX. Unfortunately there are additional services at the gateway layer that will need attention if certain services are used. ClearOS does not yet support a software-based ability to failover a gateway address but this will be coming soon. However, failing a physical port is doable now with ClearBOX. This results in states of packets to be reset and requires the passive box to take over the ARP record.

Here is a list of critical path items under ClearOS and the dependencies that will have to be solved for it to properly failover.

IDS/IPS
- None
Firewall
- None
Gateway Antimalware
- None
Gateway Antispam
- None
Proxy Server (transparent mode)
- None
Proxy Server (user authentication)
- Need directory replication between servers (master/replicate)
Content Filtration
- None
MultiWAN
- You will need sufficient bypass segments
- Alternately, you can implement software takeover (not currently supported)
OpenVPN, PPTP, IPSec
- Need directory replication between servers (master/replicate)
Dynamic VPN
- Not yet supported
Protocol Filter
- None
Bandwidth/QoS
- None

Critical Data

Critical Data is currently supportable on ClearOS using an Active/Passive method provided by DRBD. This will allow you to have a replicate of any data dynamically replicated to a slave. It is recommended that you implement ClearOS as a solution of running DRBD between two backend data stores that then present that data as an iSCSI target to the ClearOS server(s) which present that data through services.

It's important to note that your services must be cluster aware and the data must be on a clustered file system in order to be accessed by multiple front end service. Just because your data is clustered on the backend does not mean that your multiple front end services can access it without knowing about others that may be accessing it. Imagine trying to mount a backend data store running an ext3 partition on two servers without them knowing about each other. They would quickly destroy the data on that drive.

Critical Services

Here is where it gets tricky. Services are fickle about how they work. In some cases, the services on ClearOS are completely to simple in their scope to be able to be clustered, redundant, or even fail-over capable or aware. Here is our recommended paths for resolving the need to make a service capable of presenting its capability in the event of a single failure. These are guideline at this point and more will come.

OpenLDAP
- ClearOS supports Master/Replicate mode
- Multi-Master or failover of the master is not typically needed
- Promotion of Replicate to master recommended
RADIUS
- Failover IP/Round Robin
- Dependent on OpenLDAP
DHCP Server
- DNSMasq is not capable of coordinating leases
DNS Server
- DNSMasq is not capable of coordinating leases
- Manual duplication of hosts files required
  - Consider rsync
Samba
- Run ClearOS as a backup domain controller
- Replication of scripts coming
- Profile replication not yet supported
- Dependent on OpenLDAP (must be able to write to OpenLDAP)
Samba File sharing services
- By default Samba is not cluster aware, Consider Samba-CTDB
  - Samba-CTDB is best if this server is NOT a domain controller
- Failover IP/Round Robin
- Data should be housed in file system with cluster support
FTP
- FTP server is not cluster aware
- Failover IP/Round Robin
- Consider Samba-CTDB
- Data should be housed in file system with cluster support
Email (SMTP)
- Consider store and forward technologies as a simple way to get you 80% there (included if your domain is hosted on ClearSDN)
- Data must be housed in file system with cluster support
- Active/Passive
- Heartbeat required
- Data should be housed in file system with cluster support
IMAP/POP3
- Data must be housed in file system with cluster support
- Active/Passive
- Heartbeat required
- Data should be housed in file system with cluster support
Anti-SPAM (using ClearOS solely for antimalware, forwards to other server)
- Multiple MX records solves most of this nicely
- No coordinated quarantine at this point
Web Server
- Clustered file system recommended or read-only access to static content.
- Failover IP/Round Robin
- Consider Samba-CTDB
MySQL
- Number of different ways to do this

Conclusion

ClearOS is a powerful system for implementing fail-over, fail-open, and active/active services. It inherits this functionality by virtue of its legacy, innovation, and ties to all forms of Open Source code. The common question we get is, 'Is ClearOS capable of doing X?'. The answer is a loaded one because while ClearOS is capable of doing many things it is designed to doing common tasks. As the world moves to a model of both local and in-the-cloud data and as companies become more dependent on access to data we will see clustered solutions become more mainstream. Those tasks should be easy too and ClearOS is a great place to see them evolve.

search?q=clearos%2C%20clearos%20content%2C%20kb%2C%20howtos%2C%20clearbox%2C%20categorynetwork%2C%20clearos5%2C%20clearos6%2C%20clearos7%2C%20maintainer_dloper&btnI=lucky

CLEAROS DOCUMENTATION

Table of Contents