This howto focuses on specific features of ClearBOX that provide high availability. Particularly, ClearBOX uses network bypass that is capable of creating inline fail-open capability, and fail-over capability. Moreover, it is uniquely designed to handle fencing. This makes ClearBOX a unique platform for delivering high availability in a self-enclosed method.
We've provided some definition of how we will use terms in this article and on this site. Some of these technologies can be complex and difficult to implement. If you need an expert consider posting questions to the forum or using ClearCARE to resolve your issues.
Nearly every solution mentioned in this article requires a significant command line input and often uses methods that are just starting to be engineered on ClearOS. This is not a precise howto and many of the evolving links will be rough documents that still require significant familiarity with linux and clustering concepts. Beware of the leopard.
Network Bypass is a technology which allows two network devices to cause their physical interfaces to effectively 'short' themselves. This means that when a condition is reached the 8 pins for two network cards join together and simultaneously disconnect from the chipset of the network interface. The result being that the network interfaces are cumulatively a coupler.
Fencing is the process by which a member of a cluster or a fail-over group is removed from connectivity by intervening at the communication, power or other critical point. This is important because clustered members can cause significant damage to data and data flow if allowed to continue while in a failed or semi-failed state. Fencing allows one server to prohibit the other server from interaction with the data or network if it detects the failure on the other server.
Because ClearBOX uses network bypass, a secondary server can effectively fence the primary by interrupting its communication with the network.
Fail-open is a condition whereby a contiguous network path is supplied in the event of a failure. Fail-open is commonly used where the device (ClearBOX) is running as a network bridge. In this bridge mode the server may be performing firewall, or gateway services like content filtration or protocol filtering. When a fail-open condition occurs the hardware will cause the bypass to short and turn the device (ClearBOX) into a glorified coupler.
Additionally, servers in this mode can be applied in either series, or parallel as with the failover configuration to apply the desired path and conditions.
Fail-over is a condition where all services supplied by the device are handed over to a secondary device that is similarly configured. In this state, the primary device may be malfunctioning, down, or even off. In this state the back will assume all operations. This situation is sometimes referred to Active/Passive.
In this situation the primary server is on the top of the image and the failover is on the bottom. This allows bottom server to short circuit the path and effectively fence the top server in case of a failure.
This is often the most beautiful and elegant solution to clustering. It is also as hard to achieve as some peace negotiations between waring factions. It supposes that your services, your network paths, protocol design and clients are so open to the concept of getting along and that they were engineered with such keen foresight into the future as to appear super intelligent.
Some early and common uses of active include DNS which was engineered to allow a first come, first serve approach to returning data, contained replication mechanisms within the protocol and data hierarchy. Sadly, it is not tenable or efficient for all data systems to work this way because many systems need more authoritative, responsive, and realtime solutions.
The steps for configuring heartbeat on ClearOS (and in particular ClearBOX) can be found here
Heartbeat is only one element of high availability. Additional elements to consider are:
There is often a temptation to throw everything into redundancy when redundancy is an option. Be careful when considering what to make redundant and whether to throw all your eggs in one basket (meaning that more complexity can cause an increase in downtime).
For example, let's say that you put together a failover condition for your firewalls and content filtration. Your business requires them to be mission critical and you know that there are costs associated with a failover condition (sometimes you can lose state on a packet or cannot transition a session properly). You realize that this hardware is under utilized and you decide to place your database and web application on this same platform. Later you find that your gateway is failing over often because your web application causes the heartbeat to sometimes believe that it is down and needlessly transitions the boxes to a fail-over state.
Another important rule is don't use high availability when a service or protocol does NOT need it. For example, configuring you master LDAP server to failover automatically to another 'master' LDAP server is unnecessary because it is a simple and scalable solution to configure a replicate to take-over in a permanent failure situation. Additionally setting up a failover DNS server would be pointless because the service is already redundant when properly configured as master/slave.
The typical things that greatly benefit from clustering and high availability are:
This is by far the easiest clustering solution on ClearOS at this time. ClearOS is powerful as gateway solution and failing over to a secondary server is somewhat trivial when using a solution like ClearBOX. Unfortunately there are additional services at the gateway layer that will need attention if certain services are used. ClearOS does not yet support a software-based ability to failover a gateway address but this will be coming soon. However, failing a physical port is doable now with ClearBOX. This results in states of packets to be reset and requires the passive box to take over the ARP record.
Here is a list of critical path items under ClearOS and the dependencies that will have to be solved for it to properly failover.
Critical Data is currently supportable on ClearOS using an Active/Passive method provided by DRBD. This will allow you to have a replicate of any data dynamically replicated to a slave. It is recommended that you implement ClearOS as a solution of running DRBD between two backend data stores that then present that data as an iSCSI target to the ClearOS server(s) which present that data through services.
It's important to note that your services must be cluster aware and the data must be on a clustered file system in order to be accessed by multiple front end service. Just because your data is clustered on the backend does not mean that your multiple front end services can access it without knowing about others that may be accessing it. Imagine trying to mount a backend data store running an ext3 partition on two servers without them knowing about each other. They would quickly destroy the data on that drive.
Here is where it gets tricky. Services are fickle about how they work. In some cases, the services on ClearOS are completely to simple in their scope to be able to be clustered, redundant, or even fail-over capable or aware. Here is our recommended paths for resolving the need to make a service capable of presenting its capability in the event of a single failure. These are guideline at this point and more will come.
ClearOS is a powerful system for implementing fail-over, fail-open, and active/active services. It inherits this functionality by virtue of its legacy, innovation, and ties to all forms of Open Source code. The common question we get is, 'Is ClearOS capable of doing X?'. The answer is a loaded one because while ClearOS is capable of doing many things it is designed to doing common tasks. As the world moves to a model of both local and in-the-cloud data and as companies become more dependent on access to data we will see clustered solutions become more mainstream. Those tasks should be easy too and ClearOS is a great place to see them evolve.