Monitoring the HFE Node in AWS

In this section:

Feature Overview

In most SBC SWe deployments in the Amazon Web Services (AWS) cloud environment, a High-Availability Front End (HFE) node is deployed as the entity that communicates with external peers. You can monitor this essential component of the network by monitoring the flow of events that come from the HFE node. Once it is deployed and configured, the HFE node regularly sends out "HFEState" events, once every two seconds, to indicate it is up and running. An interruption in the flow of these events can indicate that the HFE node has failed and that some type of action is warranted. The IAM (Identity and Access Management) permission policies for the HFE node must be updated to include the "putRule" and "cloudwatch:PutMetricData" permissions to enable sending these events (see Example: IAM "putRule" Permission for the additional statement).

You can use an AWS CloudWatch alarm to detect and alert you when the HFE has stopped sending HFEState events. An alarm might be configured to send an email alert when triggered. Or you can develop your own script, or use an AWS mechanism such as an AWS Lambda function, to automatically trigger some kind of recovery action.

For example, you can configure an alarm to count the number of HFEState events coming from the HFE node within a specified time interval and trigger the alarm if the number of events counted is lower than expected. If the interval were 10 seconds, for example, at least four events would be expected during the interval. Therefore the threshold for triggering an alarm state can be an HFEState event count that is less than three. Specifically, the alarm can be configured to treat the missing event data as bad (breaching threshold). You can then create script or implement a Lambda function that reboots the HFE node when the HFEState event count triggers an alarm state. For an HFE node that was previously working and had been sending events successfully, a reboot can enable the HFE node to recover and resume working. Refer to AWS documentation for topics such as Amazon CloudWatch Alarms and AWS Lambda for more information on the capabilities they provide.

Note

Ribbon is not responsible for maintaining or debugging AWS or other third-party tools that you incorporate into a monitoring mechanism for the HFE node.

The specifics of the alerting mechanism you implement and the threshold to trigger an alert are up to you and should take into account how quickly you want to take action in response to a possible error condition. However, Ribbon recommends that you monitor the event count for a period of time long enough to avoid false alerts that the HFE node has failed, particularly if you implement an automatic response when an alert is triggered. Keep in mind that the HFE node only begins to send HFEState events after it is successfully deployed and configured. If you begin monitoring the node before configuration completes, the HFEState events are not sent yet and any automatic action can be triggered repeatedly without an opportunity for recovery.

Important

Be sure to disable any automatic action called for by a custom script or by a Lambda function before you upgrade, intentionally shut down, or manually reboot the HFE node. HFEState events are not sent during such times which can trigger an invalid alarm state and subsequent actions that are unnecessary or damaging.

Example: IAM "putRule" Permission

The following is an example statement (in JSON format) to create IAM policy for an HFE node which includes the "events:PutRule" and "cloudwatch:PutMetricData" actions required to enable sending regular AWS events.

Example IAM Permission Statement

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeInstances",
                "ec2:DescribeAddresses",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeInstanceAttribute",
                "ec2:DescribeRegions",
                "s3:Get*",
                "s3:List*",
                "ec2:ModifyInstanceAttribute",
                "ec2:DescribeSubnets",
                "events:PutRule",
                "cloudwatch:PutMetricData"
            ],
            "Resource": "*"
        }
    ]
}

Example: Check for Alarm State

The following is an example using thr AWS CLI "describe-alarms" command to the alarm state. Refer to https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/describe-alarms.html for more information.

Example: describe-alarms Command to Check Alarm State

#!/bin/bash
while true
do
        result=`aws cloudwatch describe-alarms --region us-east-1 --alarm-names "HFEStatusCheck" | grep StateValue`
        echo "$(date) Checking  HFE state up notification. $result"
        ### Check state OK/ALARM and take action accordingly.
       sleep 1
done

Example: Reboot Action

The following is an example showing use of a Lamba function to reboot the instance.

Example: Reboot Action Using a Lambda Function

import boto3
import json

# Enter the region your instances are in. Include only the region without specifying Availability Zone; e.g., 'us-east-1'
region = 'us-east-1'


def lambda_handler(event, context):
    ec2 = boto3.client('ec2', region_name=region)
    print "HFE lambda_handler called."
    print type(event)
   
    evtValues = event['Records']
    messages =  ((((evtValues[0])['Sns'])['Message']))
    dictTrigger     = json.loads(messages)
    instanceId      = dictTrigger['Trigger']['Dimensions'][0]['value']
    print "Will reboot HFE instance:" + str(instanceId)
    
    ec2.reboot_instances(InstanceIds=[instanceId])

Space shortcuts

Page tree

Feature Overview

Example: IAM "putRule" Permission

Example: Check for Alarm State

Example: Reboot Action