Packets that cause a direct SBC fault can lead to a catastrophic failure of an SBC service, which is known as a packet-stimulated fault avalanche. These packets appear for various reasons, such as the SBC adds a new Session Initiation Protocol (SIP) endpoint, upgrades or replaces a peering endpoint or gateway (GW), changes a configuration on a peer, or introduces a new call scenario. The SBC does not currently check for double faults, which is when the SBC experiences a failover and then another failover. Double faults cause call loss.

The goal of Avalanche Fault Detection and Control is not to prevent individual crashes, but to detect and control continuous "avalanche" faults that can lead to complete service outages. This feature uses the information from the existing faults to attempt to prevent future faults.


The fault avalanche feature does not

  • Prevent the first crash an offending packet causes,
  • Prevent crashes by similar SIP messages (such as the same call flow) that have different key values,
  • Prevent crashes that a packet does not directly (crash due to memory build-up) cause,
  • Avoid fault avalanche across separate SBC clusters, or
  • Avoid crashes in non-SIP modules.

The fault avalanche feature tracks potentially problematic values of key types in SIP packets (see Tracking for more information). To track these values the fault avalanche feature extracts and saves values from the following fields of the SIP packet that causes a crash:

  • Call ID
  • Called Party Address
  • Calling Party Address (From or P-Asserted-Identity header)
  • Source IP of the Packet

Each key type is associated with a threshold value. The threshold value indicates the maximum amount of crashes allowed for a particular value of the key type before the SBC blocks that key value (see Blocking for more information). The SBC defines the threshold values so that they ensure the threshold for more specific blocks trigger before less specific blocks. The following table shows the default key type thresholds.

Key TypeThreshold
Call ID0
Calling + Called1
Calling Party3
Called Party3
Source IP5

The SBC determines the calling party for a SIP packet in the following order:

  1. If present, the P-Asserted-Identity header user part or telephone number
  2. If present, the P-Preferred-Identity header user part or telephone number
  3. If not anonymous, the From header user part or telephone number


Note

The SBC cannot monitor the fault count for the user or users if no usable calling party information is available. However, the SBC can control faults based on the source IP.


The SBC obtains the called party for a SIP packet from either a Request URI of a SIP request or the To header URL of a SIP response.

Ribbon recommends that you set the thresholds according to how strict (or lenient) you prefer to be with faults in a cluster. If, at any moment, you do not want the SBC to block a specific key element or source, you can instruct the SBC using CLI commands.


The Layer 3 source IP address of the packet determines the peer IP address.

Tracking

Before processing every SIP message, the SBC records key elements in a special location. When a crash occurs, the SBC will take these key elements and use them to create the Fault record. On every crash, the SBC creates a fault record that contains the values of the key types (see the Default Key Type Thresholds table) of the SIP packet that causes the failure. This fault record, which survives a reboot, locally saves and broadcasts to all other SBC instances in the cluster. The SBC can discard the packet to avoid another fault if there is a match.

Every SBC maintains a table of the local crash counts by key type and key value with the data from the fault records. The default aging period for fault records is 30 minutes, so the SBC maintains records only within the last 30 minutes.

A fault record sends to the standby SBC instance within two seconds of the fault. For the SBC SWe Cloud, a fault record broadcasts throughout the SBC cluster within ten seconds of the fault.

The SBC removes a fault record that the SBC does not consider for tracking or protection functionality within five minutes of expiration. The expiration interval is a global configurable (faultAvalancheControl  faultRecAgeingTimeOut).

The SBC discards the fault record that it receives from another SBC in the cluster if the fault record is a duplicate of an existing fault record. Otherwise, the SBC treats the fault record as if it is locally generated.

Blocking

The system maintains a (logical) block table with an entry for every key type and key value combination where the number of all unexpired fault records exceeds the threshold for that key type.

The block entries update within two seconds of when the SBC receives and, or, creates a new fault record. The block entries update within five minutes of a fault record expiration.

When the SBC receives a SIP Request or Response message from another SBC(s), the system drops the packet if any key type or key value matches a blocking entry.

The following figure illustrates the input processing of received SIP messages for the fault avalanche feature.

Bootstrap

When an SBC SWe Cloud instantiates, it obtains all unexpired fault records. Subsequent processing is the same whether the SBC SWe Cloud creates the fault records locally, or receives the fault records from another SBC in a non-bootstrap scenario.

After an SBC SWe Cloud reboots and initializes, it obtains all unexpired fault records. Subsequent processing is the same whether the SBC locally creates the fault records or receives the fault records from another SBC in a non-reboot scenario.

When an SBC reboots, it uses locally persisted fault records to initialize. Subsequent processing is the same whether the SBC SWe Cloud creates the fault records locally, or receives the fault records from another SBC in a non-reboot scenario.

An SBC uses the present fault records when that SBC activates from standby to active.

Provisioning and Monitoring


Note

On an upgrade from a release that does not support this functionality to a release with support, all configurables of this functionality are set to their default values.

On an upgrade from a release that supports this functionality, all configurables of this functionality maintain their existing values.

To 

  1. Log on to the SBC.
  2. Navigate to AllSystemFac Nonblocking Entries → Fault Avalanche Control.


Fault Avalanche Control Parameters

ParameterDefaultDescription
Fault Avalanche ControlN/A

This parameter controls the fault avalanche issue.

Fac StateEnabled

Use this flag to enable or disable the Fault Avalanche Control feature. When you update this flag from enabled to disabled, the system deletes the existing fault records and blocking entries. This update does not impact the fault records that this system might have previously broadcast to other SBCs in the cluster.

  • Disabled (default) - The SBC does not perform tracking or blocking.
  • Enabled
Fac Block SuspectsDisabled

Determines if future SIP messages are blocked.

  • Enabled
  • Disabled (default)

Note: Ribbon recommends testing this flag in the lab environment before enabling it in the production environment.

Calling Party Threshold3

<0-999> - The number of crashes the specific calling party causes, after which the SBC drops the SIP messages that carry the same calling party address.

Called Party Threshold3

<0-999> - The number of crashes the specific called party causes, after which the SBC drops the SIP messages that carry the same called party address.

Call ID Threshold0

<0-999> - The number of crashes the specific call-ID causes, after which the SBC drops the SIP messages that carry the same call-ID.

Source IP Threshold5<0-999> - The number of crashes the SIP messages from a specific source IP address cause, after which the SBC drops the SIP messages from the same source IP address.
Calling NCalled Party Threshold1<0-999> - The number of crashes the specific calling & called party causes, after which the SBC drops the SIP messages that carry the same calling and called party address.
Fault Rec Ageing Time Out30<15-60> - Configure this parameter with the timeout (in minutes) of the fault record aging.