Debug SBC SWe in Azure

In this section:

This section covers common issues found in the SBC SWe in Azure setup, and the actions necessary for verification and/or troubleshooting.

Every time I start the SBC instance, it stops after a few minutes

This results from submitting invalid user-data being. Submit only valid JSON to the SBC. For more information on valid JSON, refer to SBC Userdata topic.

Action Steps:

To verify whether the problem occurs due to invalid JSON, perform the following steps:

In the portal, go to Virtual machines.
Select instance.
Go to Support + troubleshooting > Serial Console.
Start the instance.

The HFE.log continually gets the error message "Connection error ongoing - No connection to SBC PKT ports from HFE"

If the message "Connection error ongoing - No connection to SBC PKT ports from HFE" is continually written to HFE.log, it indicates that the HFE node cannot connect to the SBCs.

Action Steps:

Perform the following verification steps:

Using the CLI, verify the PKT0 and PKT1 is configured correctly. For more information on this process, refer to Configuring PKT Ports topic.
Verify the IPs listed in the HFE_conf.log are the ones attached to the SBC:
1. Go to /opt/HFE/log/.
2. Find the logs that specify the IPs for the SBC ; the logs are in the form:
```
<SBC instance name> - IP for <pkt0 | pkt1> is <IP>
```
3. Find the Alias IPs for the SBC:
  1. Go to Virtual machines.
  2. Click on the SBC instance.
  3. Go to Settings > Networking.
  4. Go to the PKT0 interface.
  5. Click on the network interface.
  6. Go to Settings > IP configurations.
  7. Verify the secondary IP matches.
  8. Repeat for the PKT1 interface.
Check the Security groups are correct:
1. Go to Network security groups.
2. Select the security group.
3. Go to Inbound security rules.
4. Verify the end point IPs are allowed.
Check the routes are correct:
1. Go to Route tables.
2. Select the route table.
3. Click on Routes and verify the routes point to the eth2 IP on the HFE node.
4. Click on Subnets and verify the the route table is associated with both subnets

Calls are failing to reach the SBC from the HFE node

Action Steps:

Verify there are no error logs in the HFE.log.
Verify the end point of the traffic is allowed access through the VPC firewalls. For more information, refer to Create Network Security Group topic.

I am unable to log on to my HFE node via the mgmt interface

This indicates that either there is a configuration issue, or the firewall rules are not been updated correctly.

Action Steps:

Verify the IP you are trying to SSH from is allowed through the network security group. For more information, refer to Create Network Security Group topic.
Verify that the IP you are trying to SSH from is present correctly in the HFE node user-data. Update the appropriate line containing "REMOTE_SSH_MACHINE_IP":
```
/bin/echo "REMOTE_SSH_MACHINE_IP=\"10.27.178.4\"" >> $NAT_VAR
```
For more information, refer to Custom Data Example topic.
The HFE script may fail before creating the routes. In such cases:
1. Attempt to SSH in to the NIC0 on the HFE node.
2. Check the logs for errors in this directory /opt/HFE/log/. For more information, refer to HFE Node Logging topic.

Instance does not get Accelerated NICs

Even without the accelerated NICs, sometimes the SWe instance starts but does not guarantee performance.

Actions Steps:

Execute the following command to confirm the availability of the Mellanox NICs.

> lspci | grep Mellanox

The sample output given below indicates presence of Mellanox NICs.

83df:00:02.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] (rev 80)
9332:00:02.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] (rev 80)

In such situations, de-allocate the instance and start again.

I am unable to update apt or install using apt on the HFE

If commands such as sudo apt update causes an error such as "connection timed out", you must add a manual route on the HFE node.

Caution

The following action steps will interrupt calls

Actions Steps:

Get the management interface gateway IP:
1. Open /opt/HFE/log/HFE_conf.log.
2. Search the log for the line which specifies the eth1.
  Example for HFE 2.1: 2020-11-02 11:45:47 ETH1_GW 10.2.0.1

Add the route to make traffic go through management interface:

sudo ip route add 0.0.0.0/0 via <Gateway IP> dev <management interface name> metric 10

Example:

ip route add 0.0.0.0/0 via 10.2.0.1 dev eth1 metric 10

Run the apt command:
```
sudo apt update
```

Remove the route you just added:

sudo ip route delete 0.0.0.0/0 via <Gateway IP> dev <management interface name> metric 10

Example:

ip route delete 0.0.0.0/0 via 10.2.0.1 dev eth1 metric 10

My HFE script fails with 'Failed to Set Route'

This issue is generally caused by Linux failing to configure the network interfaces before the HFE_AZ.sh script is run.

Action Steps:

Verify if any of the network interfaces (eth0, eth1, eth2) have a state of 'DOWN', using 'ip addr'.
An example of a good instance is:

rbbn@SBC-Terraform:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:0d:3a:11:e1:ac brd ff:ff:ff:ff:ff:ff
    inet 10.2.1.4/24 brd 10.2.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::20d:3aff:fe11:e1ac/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:0d:3a:11:eb:7c brd ff:ff:ff:ff:ff:ff
    inet 10.2.0.7/24 brd 10.2.0.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::20d:3aff:fe11:eb7c/64 scope link
       valid_lft forever preferred_lft forever
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:0d:3a:11:e2:5c brd ff:ff:ff:ff:ff:ff
    inet 10.2.3.4/24 brd 10.2.3.255 scope global eth2
       valid_lft forever preferred_lft forever
    inet6 fe80::20d:3aff:fe11:e25c/64 scope link
       valid_lft forever preferred_lft forever
5: enP1s1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master eth0 state UP group default qlen 1000
    link/ether 00:0d:3a:11:e1:ac brd ff:ff:ff:ff:ff:ff
6: enP3s3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master eth2 state UP group default qlen 1000
    link/ether 00:0d:3a:11:e2:5c brd ff:ff:ff:ff:ff:ff
rbbn@SBC-Terraform:~$

If any of the states for eth0, eth1 or eth2 are DOWN, then follow these steps to fix:
1. Reboot the instance and re-verify (will generally fix the issue).
2. If issue is not fixed, then perform either of the following steps:
  1. Destroy and re-orchestrate the whole HFE setup.
  2. Use Azure's Redeploy option to redeploy the VM:
    1. In the Azure portal, go to the VM.
    2. In the panel, scroll down and select Redeploy + reapply (under Support + TroubleShooting)
    3. Select the Redeploy button.

My HFE script fails with 'Authorization still failing after 15 minutes'

This issue is caused by either the User Assigned Managed Identity does not have the correct roles, or by Linux not configuring the network interfaces properly on the HFE.

Verify the Identity is attached to the VM:
1. Go to Virtual machines.
2. Click on the HFE instance.
3. Go to Settings > Identity.
4. Select User assigned.
5. Verify the Identity is listed.

Verify the role attached to the identity is correct (requires Azure CLI):

Get the principalId for the Identity:

az identity show --name <identityName> --resource-group <resourceGroupName> --subscription <subscriptionId>

Get the roleDefinitionName:

az role assignment list --assignee <principalId> --subscription <subscriptionId>

Get the role definition, and verify action contains the correct permissions (refer to Create Role topic):
```
az role definition list --name <roleDefinitionName> --subscription <subscriptionId>
```

If the previous steps are correct, follow instructions in the topic My HFE script fails with 'Failed to Set Route'

Space shortcuts

Page tree

Every time I start the SBC instance, it stops after a few minutes

The HFE.log continually gets the error message "Connection error ongoing - No connection to SBC PKT ports from HFE"

Calls are failing to reach the SBC from the HFE node

Action Steps:

I am unable to log on to my HFE node via the mgmt interface

Instance does not get Accelerated NICs

I am unable to update apt or install using apt on the HFE

My HFE script fails with 'Failed to Set Route'

My HFE script fails with 'Authorization still failing after 15 minutes'