In this section

Hardware alarms are generated on the AMC671 (MCH) to indicate the status of system components in the chassis and local components residing on the card.

The alarms are accessed from the alarm dashboard in the Web UI. The alarm dashboard provides an indication of system status by showing the number of critical, major, and minor alarms that are generated by system events. Individual alarms that appear in the dashboard are accessed to determine the origin of the alarm. To access the alarms, see section Alarm Dashboard Web User Interface in the DSC Alarms Guide. To view details of individual alarms, see section View Alarms.

The hardware alarms are generated by hardware sensors on the AMC671 (MCH) card. In addition to the local sensors, the MCH card reports chassis alarms generated by system components. There are two types of hardware sensors on the DSC 8000:

  • Threshold sensors
  • Discrete sensors
Note

In the event an alarm is raised for a system component, each MCH card in the chassis reports the alarm. The slot number for the alarm is set to the reporting MCH's slot number. Each instance of the hwmon application receives notification from both MCH cards in the chassis, and both applications respond with an alarm. Therefore, you will see up to four alarms and traps associated with sensor events for system components: one alarm from each instance of hwmon; and one alarm from each MCH card in the chassis.

Note

When viewing alarm details, the Event Context Name identifies the system component that caused the alarm, and the Event Context Slot identifies the MCH card reporting the event. Use the MCH slot number to determine which chassis is reporting the system component alarm event. For a multi-shelf DSC 8000 system, the MCH cards are identified by the slot number: MCH1 (slot 1) and MCH14 (slot 14) reside in the control shelf; MCH15 (slot 15) and MCH28 (slot 28) reside in the first expansion shelf; MCH29 (slot 29) and MCH42 (slot 42) reside in the second expansion shelf; and finally, MCH43 (slot 43) and MCH56 (slot 56) reside in the third expansion shelf.

Threshold sensors and the corresponding SNMP threshold alarms are described in the following sections:

MCH Threshold Sensors and Alarms

SNMP Threshold Alarms

Discrete sensors and the corresponding SNMP alarms are described in the following sections:

MCH Discrete Sensors and Alarms

SNMP Discrete Alarms

 

Note

Alarm details provided in the Web UI include the hardware sensor's Portal ID and IPMI number, which are dynamically assigned. The ID assignments change when a card is removed or when a new card is inserted in the DSC 8000 chassis; therefore, are not used to identify a hardware alarm.

MCH Hardware Threshold Sensors and Alarms

All threshold sensors available on the AMC671 (MCH) card generate alarms. An alarm is generated by a threshold sensor event. Threshold sensor events consist of voltage or temperature values crossing a pre-defined threshold level.

The following table lists and and describes the threshold hardware sensors on AMC671 (MCH) cards.

MCH threshold hardware sensors available in HWMON

 
IPMI Sensor NameAlias in HWMONDescriptionUnitsSNMP AlarmsAlarm Event
Mezz TempMezz TempTemperature sensor on the 10G Ethernet switch fabric of the MCHC degrees

6348 - 6351

Critical and Major alarms on threshold crossings (except Minor starting in 15.0)
Temp



Temp 1One of four temperature sensors on the MCH. The Temp 1 sensor monitors the outlet temperature of the card.C degrees


6348 - 6351


Critical and Major alarms on threshold crossings (except Minor starting in 15.0)


Temp 2One of four temperature sensors on the MCH. The Temp 2 sensor monitors the inlet temperature of the card.
Temp 3One of four temperature sensors on the MCH. The Temp 3 sensor monitors the temperature on the 1G Ethernet switch on the card.
Temp 4One of four temperature sensors on the MCH. The Temp 4 sensor monitors the temperature of the CPU device on the card.

 

Interpreting Threshold Sensor Events

Threshold sensor event severity levels are defined as follows:

  • Noncritical: This is a warning that one or more operating specifications are somewhat out of normal range, but there is not yet a problem to be addressed. Noncritical events are for information only, and they do not indicate that the AMC671 MCH is outside of operating limits. In general, no action is required. However, in certain contexts, system/shelf management software may initiate preventive action. For example, if several cards in a shelf report upper noncritical temperature events, the shelf manager may decide to increase fan speed.

  • Critical: The AMC671 MCH is operating within specified tolerances, but one or more specifications are getting close to the critical thresholds. Critical events indicate that the card is still within its operating limits, but it is close to exceeding one of those limits. Possible action in this case is to closely monitor the alarming sensor and take more aggressive action if it approaches the nonrecoverable threshold.

  • Nonrecoverable: The AMC671 MCH is no longer operating within specified tolerances. Nonrecoverable events indicate that the card may no longer be functioning because it is now outside of its operating limits. Action is likely required or has already been taken by the local hardware/firmware. For example, a processor may shut itself down because its maximum die temperature was exceeded, or a shelf manager may deactivate the card because the processor is too hot.

MCH Temperature Threshold Sensor Levels

A threshold sensor triggers an SNMP alarm when a pre-defined temperature threshold level is crossed by a monitored temperature. The following table shows the temperature threshold levels for temperature sensors on AMC671 (MCH) cards.

MCH Temperature Threshold Sensor Levels

Sensor

Name

Units

Lower Non-Recoverable
Threshold

(Alarm 6350)

Lower
Critical Threshold

(Alarm 6348)

Lower
Non-Critical Threshold

(Alarm 6346)

Upper
Non-Critical Threshold

(Alarm 6344)

Upper
Critical Threshold

(Alarm 6342)

Upper
Non-Recoverable Threshold

(Alarm 6340)

Temp 1

(Outlet Temp)

C
Degrees

NANANA80C90C125C

Temp 2

(Inlet Temp)

C

Degrees
NANANA80C90C125C

Temp 3

(Base Switch)

C

Degrees
NANANA80C90C125C

Temp 4

(Base CPU)

C

Degrees
NANANA80C90C125C
Mezz Temp

C

Degrees
NANANA80C90C125C
Caution

Some versions of DSC software generate minor temperature alarms when the monitored temperature crosses the Upper Non-Critical (UNC) temperature threshold. Minor temperature threshold crossing events are required for stable operation of the cooling sub-system and the resulting alarms are ignored.

Introduced in DSC software Release 15.0, temperature alarms are only generated when the monitored temperature crosses the Upper-Critical (UC) and Upper-Non-Recoverable (UNR) temperature thresholds, which generates major and critical alarms, respectively.

SNMP Threshold Sensor Alarms

The following table lists the SNMP alarms registered when a threshold sensor event occurs on the AMC671 (MCH) card.

DSC 8000 SNMP Threshold Sensor Alarms

 

MCH Hardware Discrete Sensors and Alarms

Discrete sensors return values of 'on' and 'off' or 'true' and 'false' to the system software. Each entity in the system has a 'Version Change' sensor that reports the entity's FRU state. These states are described in Intelligent Platform Management Interface Specification Second Generation (v2.0) specification.

The following Table lists and describes the discrete hardware sensors on AMC671 (MCH) cards.

MCH discrete hardware sensors available in HWMON

 
IPMI Sensor NameAlias in HWMON       DescriptionSNMP AlarmsAlarm Event
Hot-swap

Hot-swap MCH 11 

 

This is the hot-swap sensor for the MCH 1. The hot-swap sensor reports the MCH card's FRU state. These states are described in PICMG® 3.0 AdvancedTCA® Base Specification.

6310 slotExtracted

6311 slotInserted

Alarm raised on detection of state transitions for:

  • M0 (not-installed)
  • M1 (inactive)
  • M7 (lost-communication)

Alarm is cleared on detection of state transition for:

  • M4 (active)

Hot-swap MCH 21

This is the hot-swap sensor for the MCH 2. The hot-swap sensor reports the MCH card's FRU state. These states are described in PICMG® 3.0 AdvancedTCA® Base Specification.

6310 slotExtracted

6311 slotInserted

Alarm raised on detection of state transitions for:

  • M0 (not-installed)
  • M1 (inactive)
  • M7 (lost-communication)

Alarm is cleared on detection of state transition for:

  • M4 (active)
POWER GOOD

Mezz POWER GOOD

This is a PICMG boolean sensor (false = 1, true = 2) that indicates if the power sub-system on the 10G Ethernet switch portion of the MCH card is functional.

6320 powerFaultAssert

6321 powerFaultDeassert

On state transition to 'false', a powerFaultAssert Alarm is raised.

On state transition to 'true', a powerFaultDeassert (clearing) Alarm is raised.

IPMB Physical  N/ANo Alarms are associated with this entity.
Version ChangeCU1-Version Change2This sensor reports the FRU state on Cooling Unit 1. The sensor returns a one bit value assigned to eight possible FRU conditions. For example, if bit 0 is set, the condition defined by the value of 00h is present.N/AInformational sensor only. No alarms are generated.
CU2-Version Change2This sensor reports the FRU state on Cooling Unit 2. The sensor returns a one bit value assigned to eight possible FRU conditions. For example, if bit 0 is set, the condition defined by the value of 00h is present.
PM1-Version Change2This sensor reports the FRU state on Power Module 1. The sensor returns a one bit value assigned to eight possible FRU conditions. For example, if bit 0 is set, the condition defined by the value of 00h is present.
PM2-Version Change2This sensor reports the FRU state on Power Module 2. The sensor returns a one bit value assigned to eight possible FRU conditions. For example, if bit 0 is set, the condition defined by the value of 00h is present.
PM4-Version Change2This sensor reports the FRU state on Power Module 4. The sensor returns a one bit value assigned to eight possible FRU conditions. For example, if bit 0 is set, the condition defined by the value of 00h is present.

1 Each entity in the system has a hot-swap sensor that reports the entity's FRU state. These states are described in PICMG® 3.0 AdvancedTCA® Base Specification. The sensor returns a one bit value for each of the eight states, M0 - M7, as defined in the specification. For example, if bit 0 is set, the FRU is in state M0. Similarly, if bit 4 is set, the sensor returns a value of 16 (0001000b), which is the Normal (Active) state, M4.

The state values include:

[7] – 1b: FRU Operational State M7 = Communication Lost

[6] – 1b: FRU Operational State M6 = FRU Deactivation In Progress

[5] – 1b: FRU Operational State M5 = FRU Deactivation Request

[4] – 1b: FRU Operational State M4 = FRU Active

[3] – 1b: FRU Operational State M3 = FRU Activation in Progress

[2] – 1b: FRU Operational State M2 = FRU Activation Request

[1] – 1b: FRU Operational State M1 = FRU Inactive

[0] – 1b: FRU Operational State M0 = FRU Not Installed

2 Each entity in the DSC 8000 system has a version change sensor that reports the entity's FRU state. These states are described in the Intelligent Platform Management Interface Specification Second Generation, v2.0. There are six entities in the DSC 8000 system that are managed by the MCH. The sensor named 'Version Change' monitors the MCH card (MCH 1 or MCH 2). The other entities monitored by version change sensors are: two cooling units (CU1 and CU2) and three power modules (PM1, PM2, and PM4).

Each entity in the system has a 'Version Change' sensor that reports a change in the entity's FRU state.These states are described in Intelligent Platform Management Interface Specification Second Generation (v2.0) specification. The sensor returns a one bit value assigned to eight possible FRU state changes. For example, if bit 0 is set, then the condition defined by the value of 00h is present. The eight conditions include the following:

00h: hardware change detected (informational). This offset does not indicate whether the hardware change was successful or not, only that a change occurred.

01h: firmware or software change detected (informational).

02h: hardware incompatibility detected

03h: firmware or software incompatibility detected

04h: entity has an invalid or unsupported hardware version

05h: entity contains an invalid or unsupported firmware or software version

06h: hardware change detected on entity was successful (de-assertion event = unsuccessful)

07h: software or firmware change detected on entity was successful (de-assertion event = unsuccessful)

 

SNMP Discrete Sensor Alarms

The following table lists the SNMP alarms that are registered when a discrete sensor event occurs on the AMC671 (MCH) card.

DSC 8000 SNMP Discrete Sensor Alarms

 
SNMP Alarm NumberAlarm NameClearing Alarm

6310

slotExtracted

6311

6311

slotInserted

N/A

6320powerFaultAssert6321
6321powerFaultDeassertN/A

 

  • No labels