Overview

The Geo Redundancy function provides support for deploying EdgeView on RAMP at two different geographically located data centres to provide redundancy in the case when one data centre goes out of service. The deployment includes two EdgeView sites (datacenter-1 (dc1) and datacenter-2 (dc2)) and a third site (referred to as datacenter-0 (dc0) or the 'witness' site) that operate in an ACTIVE/STANDBY configuration on top of fully replicated database cluster of three separate database nodes.  The feature provides resiliency against a single failure. The purpose is that a single failure should not bring the entire system down, though there may be a brief interruption of some services as the system switches between ACTIVE/STANDBY to STANDBY/ACTIVE.  The feature does not claim to assure system uptime or availability against two or more simultaneous or concurrent failures.

Limitations  

  • Only management data regarding the devices in MySQL are replicated.
  • Cassandra data are not replicated.  As such:
    • Call data are available only on the DC that were active when they were ingested from EdgeMarcs.
    • WAN Health data will not be accurate after a switch in DC since that data are stored in Cassandra.
    • Peaks call data from EdgeView 14 will not be visible in EdgeView 16.
  • Reports and troubleshooting data such as packet captures and diagnostic snapshots are not replicated.  Thus, items that were captured are only accessible when the DC that they were captured on is currently active.
  • These limitations may be considered in future releases.

Requirements

Minimum Requirements:

  • AlmaLinux 8 or 9 or Ubuntu 20.04 LTS Virtual Machine. 
    • Use a dedicated virtual machine with no other production applications running on it. Install and update the base OS packages to the latest versions.
    • Ensure Internet connectivity to automatically install and configure software dependencies.
  • vCPU/Core: 4*
    • Example: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50 GHz
  • RAM: 8 GB*
  • NIC interface: 1
  • Disks mounted to the Virtual Machine: 2**
    • Host OS: 100 GB minimum
    • Witness (DC0)/Docker storage: 
      • Minimum 100 GB*.  Recommened 200GB
  • Docker version 25.0.5
  • Docker Compose version 2.26.1
  • EdgeView 16.4.1 or higher

*CPU, Memory and Disk resources should have monitor and alert functions configured using tools native to the environment. It is recommended that you monitor and alert at generally safe thresholds, for example use 75% of the capacity and increase the resources available to the system as necessary. 

**Configure and mount an additional disk/file system at /var/lib/docker/ prior to any installation activities. This provides greater flexibility and reliability on production systems. A sample configuration guide is provided under Configuring Storage for EdgeView and Docker section.

Note

If you choose to install the EdgeView product on a bare metal server, Ribbon recommends managing the storage provided to the OS through a logical volume manager to enable redundancy and facilitate expanding storage capacity over time. However, due to the operational complexity, Ribbon does not recommended this option for most environments. Additionally, configuring such an environment is considered outside the scope of this document. Contact your Ribbon Sales representative for guidance.


Network Port Requirements:

Additional ports MUST be opened for the Galera database replication to function.  The ports required on all 3 data centers DC0, DC1, and DC2 are:

  • 3306 - MySQL client connections
  • 4567 - Galera cluster replication traffic
  • 4568 - Galera incremental state transfer
  • 4444 - Galera full data copy from donor-cluster

Installation Process

Geo Redundancy deployment can be a complex process.  Contact your Ribbon Account prime for consultation before deploying a Geo Redundancy setup of EdgeView.


Step I: Load Balancer

EdgeView with Geo Redundancy requires a customer-provided load balancer for operation.  Load balancer software is not provided with the Geo Redundancy feature.  For illustrative purposes in this section, the HAProxy software is referred to as an example of how a load balancer could be deployed.  Consult your Ribbon Account prime if a different load balancer is used.

Purpose of the Load Balancer

  • The Geo Redundancy feature is implemented as an ACTIVE/STANDBY or STANDBY/ACTIVE pair – only one datacenter is 'ACTIVE' at a time.  The 'STANDBY' datacenter is not enabled for device management.
    • The Geo Redundancy feature will not work if traffic is randomly or arbitrarily sent to the STANDBY instead of the current ACTIVE system.
  • The load balancer must monitor the heath of the two backend datacenters.  If the currently designated ACTIVE system becomes unresponsive or unhealthy, the load balancer must redirect traffic to the STANDBY system instead which will become the new ACTIVE datacenter.
  • The load balancer represents a single, publicly exposed entry point into the overall system for devices and users to connect to.  The load balancer directs all incoming traffic to the current ACTIVE server.
    • In the general usage of the term 'load balancer', the load balancer would distribute traffic evenly across all available backend systems.  At this time, the Geo Redundancy feature only supports one available or ACTIVE backend system.
    • The load balancer must not distribute incoming requests to the STANDBY system.  That is, the load balancer must not be configured to use round robin or random distribution across both datacenters.  Traffic must only be sent to ACTIVE backend systems of which there is currently only one.

      Load balancer

Example with HAProxy

The load balancer must forward traffic on the following ports:

  • 443 - HTTPS
  • 80 - HTTP
  • 5671 - AMQPS
  • 8022 - device management
  • 8443 - Open Device GUI

  • 9142 - Cassandra connection for legacy support

To monitor the health of the datacenters, the load balancer should perform an HTTPS GET to '/scc/v1/systeminfo/hastatus' on the data centers.

  • A 2xx class response code indicates that the backend datacenter is healthy – 200 for the ACTIVE system, 202 for the STANDBY.
  • A 5xx class response or no response/timeout indicates that the backend datacenter is not healthy and should not have traffic routed to the datacenter.

For HAProxy, the global settings should be adjusted to skip attempts to verify the server HTTPS certificate.  The global settings should include:

  ssl-default-server-options ssl-min-ver TLSv1.1 no-tls-tickets
  ssl-server-verify none

HAProxy must be configured to handle the incoming HTTPS traffic with, by HAProxy terms, a frontend definition:

frontend ft_https_app
  mode tcp
  bind :443 name app
  default_backend bk_https_app

NOTE: The load balancer must not terminate the SSL/TLS connection – the HTTPS traffic must be passed through as raw TCP in order for the backend datacenter to receive client certificates.

Then configure the HAProxy backend for HTTPS – the example uses dc1.example.com for the first datacenter FQDN/IP address, and dc2.example.com for the second datacenter.  This should be replaced by the appropriate values for the installation.

backend bk_https_app
  stick-table type ip size 2 nopurge
  stick on dst
  option httpchk GET /scc/v1/systeminfo/hastatus HTTP/1.1
  http-check send hdr Host %H
server s1 dc1.example.com:443 check inter 15s check-ssl
server s2 dc2.example.com:443 check inter 15s check-ssl backup

The configuration sets up the two backend server definitions representing the datacenters and enables the health check over SSL/TLS with 15 second intervals.  One server is initially designated as 'backup' to indicate this is an ACTIVE/STANDBY situation, not a round-robin distribution.  The HAProxy is configured to 'stick' to the current designated ACTIVE datacenter and not failback automatically.  The current status on the ACTIVE/STANDBY datacenters is stored in an internal table.

The configuration for the other ports to be forwarded to the backend datacenters reference the internal table set up by the HTTPS configuration to use the same information on the current ACTIVE/STANDBY status.  As an example for the AMQPS traffic:

frontend ft_amqps_app
  mode tcp
  bind :5671 name app
  default_backend bk_amqps_app

backend bk_amqps_app
  stick on dst table bk_https_app
server s1 dc1.example.com:5671 track bk_https_app/s1
server s2 dc2.example.com:5671 backup track bk_https_app/s2

The AMQPS traffic routing is configured to track the ACTIVE/STANDBY status determined by the HTTPS health checks.  The configuration for the remaining ports – 80 and 8022 – are similar.

These configuration excerpts are provided as guidelines or examples.  An actual HAProxy deployment may require additional configuration or tuning specific to the installation network.


Step II: Witness (DC0) Installation

1) Go to the VM which is reserved for WItness and install docker there. The docker version should be equal or greater than 25.0.5
2) Now download witness build files (ev-witnesspkg.tar.xz, install-witness.sh, MD5SUM-witness.txt) at path /opt/
3) Now execute below command to start installation.

./install-witness.sh 


This script will ask few details so please set it as mentioned below, 

  1. RAMP_EV_CLUSTER_BOOTSTRAP: true
  2. RAMP_EV_CLUSTER_HOST: IP address of host (witness VM IP)
  3. RAMP_EV_CLUSTER_HOST_LIST: Witness VM’s IP address, 1st Edgeview VM’s IP address, 2nd Edgeview VM’s IP address
  4. RAMP_EV_PROXY_ADDRESS: IP address of load balancer VM

4) Do check if the container was created successfully and in a healthy state or not by executing the command below.

docker ps


Step III: 1st Edgeview (DC1) Installation

1) Download Edgeview’s 16.4.1 build files on VM which is reserved for 1st Edgeview (DC1) at path /opt/
2) Now execute below command to start installation.  Refer to Install and Configure EdgeView for more details.

Note

Perform initial SCC configuration only after the DC1 and DC2 Edgeview installation is completed.


./ install-ev.sh 

This script will ask few details so please set it as mentioned below, 

  1. network versions: 1 (i.e., IPv4 only)
  2. RAMP_EV_GEOREDUNDANCY_ENABLED: true
  3. GUI_IP: IP address of load balancer VM
  4. EV_IP: IP address of load balancer VM
  5. RAMP_EV_GEOREDUNDANCY_DATACENTER: 1 for 1st Edgeview (DC1) .
  6. RAMP_EV_CLUSTER_BOOTSTRAP: false
  7. RAMP_EV_CLUSTER_HOST: IP address of current Edgeview (DC1)  VM
  8. RAMP_EV_CLUSTER_HOST_LIST: Witness VM’s IP address, 1st Edgeview (DC1) VM’s IP address, 2nd Edgeview (DC2) VM’s IP address

 3) Do check if the container was created successfully and in a healthy state or not by executing the command below.

docker ps 

 

Step IV: 2nd Edgeview (DC2) Installation 

Repeat Step III and provide RAMP_EV_GEOREDUNDANCY_DATACENTER : 2 and RAMP_EV_CLUSTER_HOST : IP address of current (DC2) Edgeview VM. 


Step V: Accessing the GR-setup 

Access your GR-setup through Load Balancer IP and for initial SCC configuration refer step 7 of Install and Configure EdgeView.  


Upgrading EV (Edge View) from Standalone to GR (Geo Redundancy) setup

To upgrade Edgeview from Standalone to GR, contact your Ribbon Account support prime.


Backup / Restore Procedure

Database backup process

To get the backup of the database, perform the following steps:

  1. Ensure all the three sites are healthy, first.
  2. Pause replication on the dc0 (witness) site.
    1. cd /opt/ramp-ev-witness.
    2. Run: ./scripts-witness/replication-stop.sh
  3. Back up the database.
    1. There are several methods and software package that can backup the MySQL database – or by simply creating a copy of the database directory or creating an archive of the database directory.
    2. The mysqldump program is already installed in the MySQL container.  See the official reference for information:  https://dev.mysql.com/doc/refman/8.0/en/mysqldump.html
  4. Resume replication on the dc0 (witness) site:
    1. cd /opt/ramp-ev-witness.
    2. Run: ./scripts-witness/replication-start.sh

Database restore process

To restore the database, perform the following steps:

  1. The system must be shutdown.
  2. If not a fresh install, clean out the database files on the dc0 site.
  3. Restore the database backup on dc0
    1. If the restore process requires the database to be running:
      1. Start the dc0 site in bootstrap mode to (re)initialize the cluster.
      2. Restore database.
    2. If the restore must be done while the database is not running
      1. Restore the database.
      2. Start the dc0 site in bootstrap mode to (re)initialize the cluster.
    1. Start the dc1 site.
      1. Wait for the dc1 site to report as healthy
        1. In the worst case, the dc1 site may have to perform a full database sync with the dc0 site.
        2. During the full sync, the dc0 site will be in 'donor' mode.
    2. Start the dc2 site.
      1. The dc2 site should ideally not be started until the dc0 site is done being in 'donor' mode so that the dc2 site can sync, if necessary, from dc0 instead of trying from dc1.

Usage

Prerequisites

  • The user must be familiar with the concept of software containers.
  • The user must understand how to execute commands within the EdgeView containers.

Use Cases

The following are just representative use cases, and does not include all the actions that can be performed on RAMP.

  1. Needs to manually switch the processing to the standby site and set the current active site to 'STANDBY' or 'NOTREADY'

    1. In the gr-sidecar container of the active site, execute the following command to force STANDBY status:

      curl -s -X PUT -d STANDBY http://127.0.0.1:8000/dcstatus/v1/admin
    2. or to force NOTREADY status:

      curl -s -X PUT -d NOTREADY http://127.0.0.1:8000/dcstatus/v1/admin
    3. IMPORTANT– after the other site has gone active, execute in the gr-sidecar container

      curl -s -X PUT -d ACTIVE http://127.0.0.1:8000/dcstatus/v1/admin
      • This does NOT make the now standby site ACTIVE – this enables the site to go ACTIVE.  Without setting the desired admin status to ACTIVE, the standby site cannot go ACTIVE which means the system cannot fail back to the site.
  2. Gracefully shutdown the system

    i. Determine which of the two EdgeView sites – dc1 or dc2 – is the standby site and which is the active site.
               Directly connect to the EdgeView GUI of dc1 and dc2 and examine the ACTIVE or STANDBY status on the GUI (do not connect through the user provided proxy/load balancer).
               Or, execute this command within the gr-sidecar container on dc1 and dc2 which will return the current status.                     
    curl http://127.0.0.1:8000/dcstatus/v1/datacenter

    ii. Administratively set the standby site to "NOTREADY"

              In the gr-sidecar container of the standby site, execute the following command to force NOTREADY status:  

    curl -s -X PUT -d NOTREADY http://127.0.0.1:8000/dcstatus/v1/admin

    iii. Shutdown the active site.
    iv. 
    Shutdown the standby site.
    v. Shutdown the dc0 site (the witness site).
         
    NOTE: dc0 should always be shutdown last.  Do not shutdown dc0 before the two EdgeView sites.



  3. Restart the system.
    1. Start the dc0 (witness) site in bootstrap mode and wait for the system to report as healthy.
      1. NOTE: dc0 should always be started first before the other two EdgeView sites
      2. See the below section on 'Bootstrap Cluster From Witness' for the exact commands.
    2. Start the dc1 site.
    3. Wait for the dc1 site to report as healthy
      1. The dc1 site may have to perform a full database sync with the dc0 site – this will happen automatically if needed.
      2. During the full sync, the dc0 site will be in 'donor' mode.
    4. Start the dc2 site.
      1. The dc2 site may have to perform a full database sync with the dc0 site – this will happen automatically if needed.
      2. During the full sync, the dc0 site will be in 'donor' mode.
    5. While (re)starting dc1 and dc2 simultaneously is possible, this is not recommended.
      1. Only one of dc1 or dc2 can synchronize with dc0 at a given time.  So even if both are started together, one will end up being blocked by the other site synchronizing first.
      2. Because one site will be blocked from synchronizing until the other is done, this may result in warnings or error messages that the database is not responsive because the database instance is being blocked.  
  4. Recover from losing a single site/node.
    1. If a site has been completely lost, this can be replaced be a (re)install.
    2. Delete or clear any persistent database volumes to remove any corrupt or leftover database files.
    3. Start the (re)installed site.
      1. The site should rejoin the cluster and synchronize.
      2. The synchronization process may take some time as the (re)installed site must download an entire copy of the database from the other cluster members.
  5. Recovering from losing two sites.
    1. This is out of scope for the product feature.  The Geo Redundancy provides resiliency against a single failure.  Two simultaneous site failures is not something the Geo Redundancy feature can provide automatic  recovery for.
    2. If two sites fail and/or must be re-installed, a manual recovery may be possible.  As long as one site is still functional, just follow the steps from recovering from losing a single site for each site in sequence
      1. Recover/restart one of the failed sites.
      2. Wait for the site to report as healthy
      3. Recover/restart the second site.
  6. Recovering from losing all three sites
    1. This is out of scope for the product feature.  The Geo Redundancy provides resiliency against a single failure.  Three simultaneous site failures is not something the Geo Redundancy feature can provide recovery for.
    2. Recovery will be a manual process depending on the nature and severity of the failures.
      1. If at least one site can be (re)started, then after it is bootstrapped, the process from 'Recovering from losing two sites' can be followed.
      2. If no sites can be recovered at all, then the system will need to be reinstalled completely and restored from backups.