Configure basic Linux High Availability Cluster in Ubuntu with Corosync

Jellyfish Cluster - photo by robin on flickr
Jellyfish Cluster – photo by robin on flickr

[Read also: HA Cluster with DRBD file sync which adds file sync configuration between cluster nodes]

[UPDATED on March 7, 2017: tested the configuration also with Ubuntu 16.04 LTS]

This post show how to configure a basic High Availability cluster in Ubuntu using Corosync (cluster manager) and Pacemaker (cluster resources manager) software available in Ubuntu repositories (tested on Ubuntu 14.04 and 16.04 LTS). More information regarding Linux HA can be found here.

The goal of this post is to setup a freeradius service in HA. To do this we use two Ubuntu 14.04 or 16.04 LTS Server nodes, announcing a single virtual IP from the active cluster node. Notice that in this scenario each freeradius cluster istance is a standalone istance; I don’t cover application replication/synchronization between the nodes (rsync or shared disk via DRBD). Maybe I can do a new post in the future 🙂 [I did the post]

Convention:

  • PRIMARY – the name of the primary node
  • PRIMARY_IP – the IP address of the primary node
  • SECONDARY – the name of the secondary node
  • SECODARY_IP – the IP address of the secondary node
  • VIP – the IP announced from the master node of the cluster

First of all we install the needed packages

PRIMARY/SECONDARY# apt-get install pacemaker
PRIMARY# apt-get install haveged

and then we can start configuring Corosync, building on the PRIMARY node the key to be shared between the cluster nodes (using havaged package).

PRIMARY# corosync-keygen
Corosync Cluster Engine Authentication key generator.
Gathering 1024 bits for key from /dev/random.
Press keys on your keyboard to generate entropy.
[...]
Press keys on your keyboard to generate entropy (bits = 1000).
Writing corosync key to /etc/corosync/authkey.

Now we can remove the havaged package and copy the shared key from PRIMARY to SECONDARY node

PRIMARY# apt-get remove --purge haveged
PRIMARY# apt-get autoremove
PRIMARY# apt-get clean
PRIMARY# scp /etc/corosync/authkey user@SECONDARY:/tmp
SECONDARY# mv /tmp/authkey /etc/corosync
SECONDARY# chown root:root /etc/corosync/authkey
SECONDARY# chmod 400 /etc/corosync/authkey

We are ready now to configure both cluster nodes telling to corosync cluster members, binding IPs and other stuff. To do this edit /etc/corosync/corosync.conf and add a new section (nodelist) on PRIMARY and SECONDAY nodes at the end of the file, as follow.

[Ubuntu 16.04] don’t add the line “name: …” in nodelist section, the corosync version installed in 16.04 don’t support this directive, your cluster will not start. By default the node names are taken from “uname -a” command.

file: /etc/corosync/corosync.conf
[...]
totem {
[...]
interface {
 # The following values need to be set based on your environment 
 ringnumber: 0
 bindnetaddr: <PRIMARY_IP or SECONDARY_IP based on the node>
 mcastaddr: 226.94.1.1
 mcastport: 5405
 }
}
[... end of file ...]

nodelist {
 node {
 ring0_addr: <PRIMARY_IP>
 name: primary --> DON'T ADD THIS LINE IN 16.04 --> node name (eg. primary) 
 nodeid: 1 --> node numeric ID (eg. 1)
 }
 node {
 ring0_addr: <SECONDARY_IP>
 name: secondary --> DON'T ADD THIS LINE IN 16.04 --> node name (eg. secondary)
 nodeid: 2 --> node numeric ID (eg. 2)
 }
}

Now we configure corosync to use Cluster Resource Manager Pacemaker. To do this create the new file /etc/corosync/service.d/pcmk with following content

[Ubuntu 16.04] First create the /etc/corosync/service.d/ directory with the command # mkdir /etc/corosync/service.d/

file: /etc/corosync/service.d/pcmk
service {
 name: pacemaker
 ver: 1
}

Then enable corosync setting to yes the START parameter

file: /etc/default/corosync
START=yes

Corosync is ready to be started. Follow start and verify commands

PRIMARY/SECONDARY# service corosync start
[...]
PRIMARY/SECONDARY# service corosync status
● corosync.service - Corosync Cluster Engine
 Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
 Active: active (running) since [...]
[...]
PRIMARY/SECONDARY# corosync-cmapctl | grep members
runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(<PRIMARY_IP>) 
runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1.status (str) = joined
runtime.totem.pg.mrp.srp.members.740229595.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.740229595.ip (str) = r(0) ip(<SECONDARY_IP>)
runtime.totem.pg.mrp.srp.members.740229595.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.740229595.status (str) = joined

Now it’s time to configure pacemaker, our Cluster Resource Manager.
We enable pacemaker at boot time, setting service priority to 20 (corosync has 19), then we start the service

PRIMARY/SECONDARY# update-rc.d pacemaker defaults 20 01
PRIMARY/SECONDARY# service pacemaker start
[...]
PRIMARY/SECONDARY# service pacemaker status
● pacemaker.service - Pacemaker High Availability Cluster Manager
 Loaded: loaded (/lib/systemd/system/pacemaker.service; enabled; vendor preset: enabled)
 Active: active (running) since [...]
[...]

All the service are (hopefully) in the right state and we can check with crm utility.

[Ubuntu 14.04] the node names will be the one defined in file /etc/corosync/corosync.conf 

[Ubuntu 16.04] the node names will be the taken from “uname -a” command (host names)

PRIMARY/SECONDARY# crm status
Last updated: [...]
Last change: [...] via crm_node on primary
Stack: corosync
Current DC: primary (1) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
0 Resources configured

Online: [ primary secondary ]

We see both nodes (primary and secondary) online, with the numeric id of current node highlighted.

Now that the cluster infrastructure is ok we do some fine tuning:

  • stonith disable: we avoid automatic cluster node deletion, in a 2 nodes cluster is useless;
  • quorum policy disable: in a 2 nodes cluster we want the cluster up&running also with a single node.
PRIMARY# crm configure property stonith-enabled=false
PRIMARY# crm configure property no-quorum-policy=ignore
PRIMARY/SECONDARY# crm configure show
node $id="1" primary
node $id="2" secondary
property $id="cib-bootstrap-options" \
 dc-version="1.1.10-42f2063" \
 cluster-infrastructure="corosync" \
 stonith-enabled="false" \
 no-quorum-policy="ignore"

We are ready to add resources (Resource Agents) to pacemaker and, as we said before, we will add an IP address (VIP) and the freeradius system service (we need to install it before)

PRIMARY/SECONDARY# apt-get install freeradius

A Resource Agent is “a standardized interface for a cluster resource. In translates a standard set of operations into steps specific to the resource or application, and interprets their results as success or failure.” (have a look here for more information).

We can use two kinds of Resource Agents:

  • LSB: those found on /etc/init.d/ dir and provided by the OS. freeradius will be one of these;
  • OCF: specific resources than can also be downloaded and installed from the web; an extension to LSB resources. VIP will be one of these.

First we configure the VIP, that is an OCF resoure and is called IPaddr2 (binded to eth0 interface)

PRIMARY# crm configure primitive vip1 ocf:heartbeat:IPaddr2 params ip="<VIP>" nic="eth0" op monitor interval="10s"
PRIMARY# crm configure show
node $id="1" primary
node $id="2" secondary
primitive vip1 ocf:heartbeat:IPaddr2 \
 params ip="<VIP>" nic="eth0" \
 op monitor interval="10s" \
 meta target-role="Started"
[...]
PRIMARY# crm status
Last updated: [...]
Last change: [...] via cibadmin on primary
Stack: corosync
Current DC: primary (1) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
1 Resources configured

Online: [ primary secondary ]

vip1 (ocf::heartbeat:IPaddr2): Started primary
PRIMARY#

The VIP (resurce vip1) is started on primary node and we can check this directly from nodes

PRIMARY# ip addr show 
[...]
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
 [...]
 inet <PRIMARY_IP> brd <PRIMARY_BROADCAST> scope global eth0
 valid_lft forever preferred_lft forever
 inet <VIP>/32 brd <VIP_BROADCAST> scope global eth0
 valid_lft forever preferred_lft forever
 [...]

SECONDARY# ip addr show
[...]
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
 [...]
 inet <SECONDARY_IP> brd <SECONDARY_BROADCAST> scope global eth0
 valid_lft forever preferred_lft forever
 [...]

On network side now we are ok, let proceed with freeradius clustering. We add the LSB resource on PRIMARY node

PRIMARY# crm configure primitive freeradius lsb:freeradius \
 op monitor interval="5s" timeout="15s" \
 op start interval="0" timeout="15s" \
 op stop interval="0" timeout="15s" \
 meta target-role="Started"

We have two resources configured (vip1 and freeradius) and the cluster can start each resource on a different node. So we clone the freeradius resource allowing the freeradius service to be active on both nodes at the same time (in this particular case is the right choise and is faster when the cluster switches)

PRIMARY# crm configure clone freeradius-clone freeradius
PRIMARY# crm_mon
Online: [ primary secondary ]

vip1 (ocf::heartbeat:IPaddr2): Started primary
 Clone Set: freeradius-clone [freeradius]
 Started: [ primary secondary ]

Last tuning.
We define resources colocation, saying to the cluster that one resource depends from the location of another resource. This configuration ensure that all the resources involved run on the master cluster node at same time.

PRIMARY# crm configure colocation vip1-freeradius inf: vip1 freeradius-clone

Now we have the cluster up&running, enjoy!

7 thoughts on “Configure basic Linux High Availability Cluster in Ubuntu with Corosync

  1. Thank you this is a good read. I tested this with 2 ubuntu 16.04 servers. I was able to down the interface on the primary and it failed over to the other node no problem. How do I restore the VIP on the primary after recovery?

    Like

    1. Hi Jon,
      you can test resource failover settinng a node in standby.
      When a node goes in standby it exit from the cluster (this is like a node fail) and all the resources go to the other node.
      Then after resources migration is complete you can put back the node into the cluster (set online).
      You can do this on both nodes to test the failover.

      Into the active physical node
      # crm node standby
      # crm_mon
      [ you will se the node in standby status with resources migrated to the other node]

      After resource migration
      # crm node online
      # crm_mon
      [ you will se the node in online status and resources active on the other node]

      If you want to migrate resource from one node to the other check this
      https://unix.stackexchange.com/questions/170986/pacemaker-migrate-resource-without-adding-a-prefer-line-in-config

      This is to migrate resources, look at the last answer it’s important to use the unmigrate command to avoid side effects

      Like

      1. thank you very much for the insight!
        Is it possible to load balance traffic to a virtual IP?

        For example: I want to set up a SIP load balancer, instead of failing over to the standby node I would like to distribute to several nodes. Any protips appreciated

        Like

      2. > Is it possible to load balance traffic to a virtual IP?
        of course yes.
        You need to configure a specific load balancing service like HAProxy http://www.haproxy.org/

        I found a tutorial that you can also use

        Set up HAProxy with Pacemaker/Corosync on Ubuntu 16.04

        This Document roughly describes a HAProxy Cluster Setup on Ubuntu 16.04 based on an example Configuration with 3 Nodes

        This Document is still work in Progress the Following Stuff still needs to be done:

        • Explain the crm configure steps
        • explain Miscellaneous CRM Commands for Cluster Management
        • Add all the external ressources used.
        • Add a simple HAProxy Configuration for testing purpouse

        Example Installation

        This example Installation consists of three Nodes with the following names and IP Addresses:

        • haproxy01-test 10.0.0.11

        • haproxy02-test 10.0.0.12

        • haproxy03-test 10.0.0.13

        • VIRTUAL IP 10.0.0.10

        The Network they are on is: 10.0.0.0/24

        If you would like to apply the Steps shown here to another environment, you need to replace all Network Addresses with the ones yoused in your Environment.

        Prerequisites

        The Following Prerequisites must be met for this to work:

        • All Nodes must have a valid Network Configuration and must be on the same Network.
        • All Nodes must be able to download and install Standard Ubuntu Packages.
        • Root Acces to every Node is needed.

        Installation and Configuration of Pacemaker

        This must be run on every Node

        # Updgrade Ubuntu Installation
        sudo apt update
        sudo apt upgrade -y
        # Install pacemaker and haveged Package
        sudo apt install pacemaker haproxy -y
        systemctl stop corosync
        systemctl stop haproxy
        systemctl disable haproxy

        This must be run on the primary Node only (i.e haproxy01-test 10.0.0.11):

        # Installation of haveged package to generate better random numbers for Key Generation
        sudo apt install haveged -y
        # Corosync Key generation:
        sudo corosync-keygen
        # Renmoval of no longer needed haveged package
        sudo apt remove haveged -y

        Now we need to Copy the generated Key from the primary node over to the secondary nodes:

        scp /etc/corosync/authkey USER@10.0.0.12:/tmp/corosync-authkey
        scp /etc/corosync/authkey USER@10.0.0.13:/tmp/corosync-authkey

        This must be run on the two secondary Nodes (i.e. haproxy02-test 10.0.0.12 and haproxy03-test 10.0.0.13):

        sudo mv /tmp/corosnyc-authkey /etc/corosync/authkey
        sudo chown root: /etc/corosync/authkey
        sudo chmod 400 /etc/corosync/authkey

        After this you need to create the Following minimal Corosync Configuration File on every Node:

        totem {
          version: 2
          cluster_name: haproxy-prod
          transport: udpu
        
          interface {
            ringnumber: 0
            bindnetaddr: 10.0.0.0
            broadcast: yes
            mcastport: 5407
          }
        }
        
        nodelist {
          node {
            ring0_addr: 10.0.0.11
          }
          node {
            ring0_addr: 10.0.0.12
          }
          node {
            ring0_addr: 10.0.0.13
          }
        }
        
        quorum {
          provider: corosync_votequorum
        }
        
        logging {
          to_logfile: yes
          logfile: /var/log/corosync/corosync.log
          to_syslog: yes
          timestamp: on
        }
        
        service {
          name: pacemaker
          ver: 1
        }
        

        Inside the interface portion you can find the bindnetaddr value which must be set to the corosponding Network Address

        Inside the nodelist every node is represented by its IP Addres, if you happen to have less or more then Thre nodes, you must add them here.

        This must also be run on every Node:

        # Enable and restart Corosync Service
        sudo systemctl restart corosync.service
        sudo systemctl enable corosync.service
        # Enable and restart Pacemaker Service
        update-rc.d pacemaker defaults 20 01
        sudo systemctl restart pacemaker.service
        sudo systemctl enable pacemaker.service

        To make sure corosync is up and running, run the command sudo crm status the Output should tell you that the Stack in use is corosync and that there are thre Nodes configured, it should look like this:

        crm status:
        Last updated: Fri Oct 16 14:38:36 2015
        Last change: Fri Oct 16 14:36:01 2015 via crmd on primary
        Stack: corosync
        Current DC: primary (1) - partition with quorum
        Version: 1.1.10-42f2063
        3 Nodes configured
        0 Resources configured
        
        
        Online: [ primary secondary ]
        

        The following Steps can be run on any (one) Node, because right now corosync should keep the Cluster Configuration in Sync:

        sudo crm configure property stonith-enabled=false
        sudo crm configure property no-quorum-policy=ignore
        sudo crm configure primitive VIP ocf:heartbeat:IPaddr2 \
        params ip="10.0.0.10" cidr_netmask="24" nic="ens160" \
        op monitor interval="10s" \
        meta migration-threshold="10"
        sudo crm configure primitive res_haproxy lsb:haproxy \
        op start timeout="30s" interval="0" \
        op stop timeout="30s" interval="0" \
        op monitor interval="10s" timeout="60s" \
        meta migration-threshold="10"
        sudo crm configure group grp_balancing VIP res_haproxy

        The last Thing you need to do, is to keep your haproxy Configuration sync on every node.

        Like

  2. Great post
    i am facing few problem
    here is my status
    service corosync staatus
    Usage: /etc/init.d/corosync {start|stop|restart|force-reload}
    root@www:~# service corosync status
    * corosync is running

    just only show this

    corosync-cmapctl | grep members
    runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
    runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(x.x.x.x)
    runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
    runtime.totem.pg.mrp.srp.members.2.status (str) = joined

    ( My SSH one port is 22 and another is 2222)

    Like

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.