Good Contents Are Everywhere, But Here, We Deliver The Best of The Best.Please Hold on!


Some times we encounter Infiniband Port related issues. These alerts can be triggered from  from OEM or any other monitoring tools.
 
Sample Alert from OEM 12c:

Example 1:
Port xx on dm01sw-ib3.netsoftmate.com is disconnected from port xx 

Example 2:
Cable is present on Port xx but the port is disabled.

This document provides the steps to resolve Infiniband Switch Port related issues mentioned above.

Unless otherwise stated, run all the commands from compute node 1.

Identify the Problematic Infiniband Switch Port

  • Using OEM 12c


Log in to OEM 12c using web browser of your choice

Click on Target à Exadata
From the list select the appropriate Exadata Cluster
From the left pain expand “IB Network”
Select the Infiniband switch having problem.

Now this will display the switch status. If there are any issues with the port it will mark in RED


From the picture above we can see that there is an issue with Port 35 on Infiniband Switch “dm01sw-ib3”


  • Using IB Switch Commands

Verify-Topology

Oracle supplies a script/utility called /opt/oracle.SupportTools/ibdiagtools/verify-topology, with Exadata, which is used to validate InfiniBand network layout.

Verify the InfiniBand topology using the following command from a database server or Exadata Storage Server:

[root@dm01db01]# cd /opt/oracle.SupportTools/ibdiagtools/
[root@dm01db01]# ./verify-topology

Oracle Exadata Database Machine includes the verify-topology utility. This utility can be used to identify the following network connection problems:

  • Missing InfiniBand cable 
  • Missing InfiniBand connection
  • Incorrectly-seated cable 
  • Cable connected to the wrong endpoint
[root@dm01db01]# cd /opt/oracle.SupportTools/ibdiagtools/

[root@dm01db01]# ./verify-topology

        [ DB Machine Infiniband Cabling Topology Verification Tool ]
                [Version IBD VER 2.d ]
External non-Exadata-image nodes found:
…will check for ZFS if on SSC – else ignore

Found 2 leaf, 1 spine, 0 top spine switches

Check if all hosts have 2 HCAs to different switches……………[SUCCESS]
Leaf switch check: cardinality and even distribution…………..[SUCCESS]
Spine switch check: Are any Exadata nodes connected …………..[SUCCESS]
Spine switch check: Any inter spine switch links………………[SUCCESS]
Spine switch check: Any inter top-spine switch links…………..[SUCCESS]
Spine switch check: Correct number of spine-leaf links…………[SUCCESS]
Leaf switch check: Inter-leaf link check……………………..[SUCCESS]
Leaf switch check: Correct number of leaf-spine links………….[SUCCESS]


 In the example above, there are NO ERRORS reported.

 Listlinkup

Run the listlinkup command to verify InfiniBand Port status enabled/disabled:
Run this command on problematic Infiniband Switch.

[root@dm01db01]# ssh root@dm01sw-ib3
[root@dm01sw-ib3 ~]# listlinkup

[root@dm01sw-ib3 ~]# listlinkup
Connector  0A Not present
Connector  1A Not present
Connector  2A Not present
Connector  3A Not present
Connector  4A Not present
Connector  5A Not present
Connector  6A Present <-> Switch Port 35 is down (AutomaticHighErrorRate)
Connector  7A Present <-> Switch Port 33 is up (Enabled)
Connector  8A Present <-> Switch Port 31 is up (Enabled)
Connector  9A Present <-> Switch Port 14 is up (Enabled)
Connector 10A Present <-> Switch Port 16 is up (Enabled)
Connector 11A Present <-> Switch Port 18 is up (Enabled)
Connector 12A Not present
Connector 13A Not present
Connector 14A Present <-> Switch Port 07 is up (Enabled)
Connector 15A Not present
Connector 16A Not present
Connector 17A Present <-> Switch Port 01 is up (Enabled)
Connector  0B Not present
Connector  1B Not present
Connector  2B Not present
Connector  3B Not present
Connector  4B Not present
Connector  5B Present <-> Switch Port 29 is up (Enabled)
Connector  6B Not present
Connector  7B Present <-> Switch Port 34 is up (Enabled)
Connector  8B Not present
Connector  9B Present <-> Switch Port 13 is up (Enabled)
Connector 10B Present <-> Switch Port 15 is up (Enabled)
Connector 11B Present <-> Switch Port 17 is up (Enabled)
Connector 12B Not present
Connector 13B Present <-> Switch Port 10 is up (Enabled)
Connector 14B Not present
Connector 15B Not present
Connector 16B Present <-> Switch Port 04 is up (Enabled)
Connector 17B Present <-> Switch Port 02 is up (Enabled)

There is an issue with port 32 on the Infiniband Switch “dm01sw-ib3”.
This need to be addressed.

Ibswitches

Use this command to get the Infiniband switch LID number.

[root@dm01sw-ib3 ~]# ibswitches
Switch  : 0x002128469deca0a0 ports 36 “SUN DCS 36P QDR dm01sw-ib3 10.213.23.85” enhanced port 0 lid 3 lmc 0
Switch  : 0x002128469e45a0a0 ports 36 “SUN DCS 36P QDR dm01sw-ib2 10.213.23.84” enhanced port 0 lid 1 lmc 0

Here the lid number for dm01sw-ib3 is 3.

Ibportstate

Use this command to identify the port state.

[root@dm01sw-ib3 ~]# ibportstate 3 35
PortInfo:
# Port info: Lid 3 port 35
LinkState:…………………..Down
PhysLinkState:……………….Disabled
LinkWidthSupported:…………..1X or 4X
LinkWidthEnabled:…………….1X or 4X
LinkWidthActive:……………..4X
LinkSpeedSupported:…………..2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:…………….2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:……………..2.5 Gbps

From the output above we can see that the port is diabled and the link speed is reduced.

Getportstatus:

Use this command to get the port status

[root@dm01sw-ib3 ~]# getportstatus 35
Port status for connector 6A Switch port 35
Adminstate:………………….Disabled (AutomaticHighErrorRate)
LinkWidthEnabled:…………….1X or 4X
LinkWidthSupported:…………..1X or 4X
LinkWidthActive:……………..4X
LinkSpeedSupported:…………..2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkState:…………………..Down
PhysLinkState:……………….Disabled
LinkSpeedActive:……………..2.5 Gbps
LinkSpeedEnabled:…………….2.5 Gbps or 5.0 Gbps or 10.0 Gbps
NeighborMTU:…………………4096
OperVLs:…………………….VL0


  • Step to resolve the IB Port Issue

Autodisable is a feature that can display the connectors in the presence of high error rates or suboptimal link speed or width.
This feature doesn’t cause any issues, it just alerts customer with abnormal status of connectors.
Autodisable feature has been introduced only in firmware 2.1 and does not apply to firmware 1.3.
Correct way to account for this is to check and ensure whether any auto-disabled ports exist and if present then re-enable using enableswitchport –automatic ‘before’ up/downgrading fw to a different version. This will ensure compatible settings when moving between different fw.

Problematic Inifiniband switch details:
Switch name              :           dm01sw-ib3
Firware verison        :           2.1.3-4
Port number             :           35
Lid number                :           3

This solution for the Infiniband switch firmware verion “2.1.3-4”.

To reenable an autodisabled connector or IB switch port, on the leaf switch dm01sw-ib3 do the following:

[root@dm01sw-ib3 ~]# enableswitchport –automatic Switch 35
Enable connector 6A Switch port 35
Adminstate:………………….Enabled
LinkWidthEnabled:…………….1X or 4X
LinkWidthSupported:…………..1X or 4X
LinkWidthActive:……………..4X
LinkSpeedSupported:…………..2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkState:…………………..Down
PhysLinkState:……………….PortConfigurationTraining
LinkSpeedActive:……………..2.5 Gbps
LinkSpeedEnabled:…………….2.5 Gbps or 5.0 Gbps or 10.0 Gbps
NeighborMTU:…………………4096
OperVLs:…………………….VL0


  • Verify

 Now verify the port status using the following different commands.

Ibportstate command

[root@dm01sw-ib3 ~]# ibportstate 3 35
PortInfo:
# Port info: Lid 3 port 35
LinkState:…………………..Active
PhysLinkState:……………….LinkUp
LinkWidthSupported:…………..1X or 4X
LinkWidthEnabled:…………….1X or 4X
LinkWidthActive:……………..4X
LinkSpeedSupported:…………..2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:…………….2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:……………..10.0 Gbps
Peer PortInfo:
# Port info: Lid 3 DR path slid 65535; dlid 65535; 0,35 port 2
LinkState:…………………..Active
PhysLinkState:……………….LinkUp
LinkWidthSupported:…………..1X or 4X
LinkWidthEnabled:…………….1X or 4X
LinkWidthActive:……………..4X
LinkSpeedSupported:…………..2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:…………….2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:……………..10.0 Gbps

Getportstatus command

[root@dm01sw-ib3 ~]# getportstatus 35
Port status for connector 6A Switch port 35
Adminstate:………………….Enabled
LinkWidthEnabled:…………….1X or 4X
LinkWidthSupported:…………..1X or 4X
LinkWidthActive:……………..4X
LinkSpeedSupported:…………..2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkState:…………………..Active
PhysLinkState:……………….LinkUp
LinkSpeedActive:……………..10.0 Gbps
LinkSpeedEnabled:…………….2.5 Gbps or 5.0 Gbps or 10.0 Gbps
NeighborMTU:…………………4096
OperVLs:…………………….VL0

Listlinkup command

[root@dm01sw-ib3 ~]# listlinkup
Connector  0A Not present
Connector  1A Not present
Connector  2A Not present
Connector  3A Not present
Connector  4A Not present
Connector  5A Not present
Connector  6A Present <-> Switch Port 35 is up (Enabled)
Connector  7A Present <-> Switch Port 33 is up (Enabled)
Connector  8A Present <-> Switch Port 31 is up (Enabled)
Connector  9A Present <-> Switch Port 14 is up (Enabled)
Connector 10A Present <-> Switch Port 16 is up (Enabled)
Connector 11A Present <-> Switch Port 18 is up (Enabled)
Connector 12A Not present
Connector 13A Not present
Connector 14A Present <-> Switch Port 07 is up (Enabled)
Connector 15A Not present
Connector 16A Not present
Connector 17A Present <-> Switch Port 01 is up (Enabled)
Connector  0B Not present
Connector  1B Not present
Connector  2B Not present
Connector  3B Not present
Connector  4B Not present
Connector  5B Present <-> Switch Port 29 is up (Enabled)
Connector  6B Not present
Connector  7B Present <-> Switch Port 34 is up (Enabled)
Connector  8B Not present
Connector  9B Present <-> Switch Port 13 is up (Enabled)
Connector 10B Present <-> Switch Port 15 is up (Enabled)
Connector 11B Present <-> Switch Port 17 is up (Enabled)
Connector 12B Not present
Connector 13B Present <-> Switch Port 10 is up (Enabled)
Connector 14B Not present
Connector 15B Not present
Connector 16B Present <-> Switch Port 04 is up (Enabled)
Connector 17B Present <-> Switch Port 02 is up (Enabled)

Conclusion
In this article we have learned various Infiniband Switch command to identify the port status and resolve the port related issues.

1



When working with Oracle Support on a Infiniband Switch Hardware Service Request, Oracle Support request you to upload ILOM SNAPSHOT to properly assess the hardware failure. Starting with Exadata X4 and higher, you can now collect snapshot for Infiniband Switch using web browser interface. 

In this article I will demonstrate the steps to collect ILOM snapshot data for an Infiniband Switch. You connect to Infiniband Switch using a web browser to collect the ILOM snapshot.

Steps to collect ILOM Snapshot for IB Switch

  • Open a web browser (use something other than Internet Explorer) and enter the Infiniband Switch hostname.

Note: There is NO *-ILOM* in the hostname.

  • Enter root as User Name and its password and click on Log In.

 

Note:  You may see complaints about security – ignore or override – click I understand the risks/Add exception/Confirm Security Exception

  • Select Maintenance -> Snapshot

  • This will take you to the Server Snapshot Utility Page show below

 


On the above Screen, Select Data Set “Normal”, Select Transfer Method as “Browser” and Click “Run”.

Normal – Specifies that ILOM, operating system, and hardware information is collected.
The download file will be saved according to your browser settings.

Important Note:  Do not enable this option:Collect Only Log Files from Data Set‘.  Doing so will limit the snapshot to a much smaller sub-section of log files.
 


  • In the dialog box, specify the directory to which to save the file and the file name.

Click OK. The file is saved to the specified directory.
 



  • Upload the zip to Oracle Support SR for review.


Conclusion
In this article we have learned how to collect the ILOM Snapshot diagnostic data for Infiniband Switch to investigate the hardware failure. It common that Oracle Support request you to upload ILOM snapshot for IB switch to investigate hardware issues.
 
0

We received the following error message from OEM that port 36 on IB Switch dm01sw-iba01 has errors

Alert message from OEM 12c:

Host=dm01db01.netsoftmate.com
Target type=Oracle Infiniband Switch
Target name=dm01sw-iba01.netsoftmate.com
Categories=Error
Message=Port 36 has 10 total errors, crossed warning (10) or critical ( ) threshold.
Severity=Warning
Event reported time=May 29, 2017 2:11:14 AM CDT
Target Lifecycle Status=Production


Here are the few IB commands that can be used to identify the problem with IB Port.

Troubleshooting steps:

  • Login to problematic IB switch using putty as root user
login as: root
root@dm01sw-iba01.netsoftmate.com’s password:
Last login: Wed May 17 01:36:17 2017 from dm01db01.netsoftmate.com
You are now logged in to the root shell.
It is recommended to use ILOM shell instead of root shell.
All usage should be restricted to documented commands and documented
config files.
To view the list of documented commands, use “help” at linux prompt.

 
  • Using listlinkup command to check the port status.
[root@dm01sw-iba01 ~]# listlinkup
Connector  0A Not present
Connector  1A Not present
Connector  2A Not present
Connector  3A Not present
Connector  4A Not present
Connector  5A Not present
Connector  6A Present <-> Switch Port 35 is up (Enabled)
Connector  7A Present <-> Switch Port 33 is up (Enabled)
Connector  8A Present <-> Switch Port 31 is up (Enabled)
Connector  9A Present <-> Switch Port 14 is up (Enabled)
Connector 10A Present <-> Switch Port 16 is up (Enabled)
Connector 11A Present <-> Switch Port 18 is up (Enabled)
Connector 12A Not present
Connector 13A Present <-> Switch Port 09 is up (Enabled)
Connector 14A Present <-> Switch Port 07 is up (Enabled)
Connector 15A Present <-> Switch Port 05 is up (Enabled)
Connector 16A Present <-> Switch Port 03 is up (Enabled)
Connector 17A Present <-> Switch Port 01 is up (Enabled)
Connector  0B Not present
Connector  1B Not present
Connector  2B Not present
Connector  3B Not present
Connector  4B Not present
Connector  5B Not present
Connector  6B Present <-> Switch Port 36 is up (Enabled)
Connector  7B Present <-> Switch Port 34 is up (Enabled)
Connector  8B Present <-> Switch Port 32 is down (Enabled)
Connector  9B Present <-> Switch Port 13 is up (Enabled)
Connector 10B Present <-> Switch Port 15 is up (Enabled)
Connector 11B Present <-> Switch Port 17 is up (Enabled)
Connector 12B Present <-> Switch Port 12 is up (Enabled)
Connector 13B Present <-> Switch Port 10 is up (Enabled)
Connector 14B Present <-> Switch Port 08 is up (Enabled)
Connector 15B Present <-> Switch Port 06 is up (Enabled)
Connector 16B Present <-> Switch Port 04 is up (Enabled)
Connector 17B Present <-> Switch Port 02 is up (Enabled)


From the above output we can see that port 36 is Enabled and there are no issues reported.


  • Using ibportstate and getportstatus IB commands to identify the port status
First identify the lid number for the problematic IB Switch.
Here the lid number for IB switch dm01sw-iba01 is 1.

[root@dm01sw-iba01 ~]# ibswitches
Switch  : 0x0010e0650e2ea0a0 ports 36 “SUN DCS 36P QDR dm01sw-iba01 10.21.50.2″ enhanced port 0 lid 1 lmc 0
Switch  : 0x0010e0650d90a0a0 ports 36 “SUN DCS 36P QDR dm01sw-ibb01 10.21.50.3” enhanced port 0 lid 2 lmc 0

[root@dm01sw-iba01 ~]# ibportstate 1 36
PortInfo:
# Port info: Lid 1 port 36
LinkState:…………………..Active
PhysLinkState:……………….LinkUp
LinkWidthSupported:…………..1X or 4X
LinkWidthEnabled:…………….1X or 4X
LinkWidthActive:……………..4X
LinkSpeedSupported:…………..2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:…………….2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:……………..10.0 Gbps
Peer PortInfo:
# Port info: Lid 1 DR path slid 65535; dlid 65535; 0,36 port 2
LinkState:…………………..Active
PhysLinkState:……………….LinkUp

LinkWidthSupported:…………..1X or 4X
LinkWidthEnabled:…………….1X or 4X
LinkWidthActive:……………..4X
LinkSpeedSupported:…………..2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:…………….2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:……………..10.0 Gbps

[root@dm01sw-iba01 ~]# getportstatus 36
Port status for connector 6B Switch port 36
Adminstate:………………….Enabled
LinkWidthEnabled:…………….1X or 4X
LinkWidthSupported:…………..1X or 4X
LinkWidthActive:……………..4X
LinkSpeedSupported:…………..2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkState:…………………..Active
PhysLinkState:……………….LinkUp
LinkSpeedActive:……………..10.0 Gbps

LinkSpeedEnabled:…………….2.5 Gbps or 5.0 Gbps or 10.0 Gbps
NeighborMTU:…………………4096
OperVLs:…………………….VL0-3


From the above output we can see that port 36 is Enabled, linkstate is Active and there are no issues reported.


  • Using ibdiagnet command to identify the network quality and errors.
[root@dm01sw-iba01 ~]# ibdiagnet
Loading IBDIAGNET from: /usr/lib/ibdiagnet1.2
-W- Topology file is not specified.
    Reports regarding cluster links will use direct routes.
Loading IBDM from: /usr/lib/ibdm1.2
-I- Using port 0 as the local port.
-I- Discovering … 17 nodes (2 Switches & 15 CA-s) discovered.


-I—————————————————
-I- Bad Guids/LIDs Info
-I—————————————————
-I- skip option set. no report will be issued

-I—————————————————
-I- Links With Logical State = INIT
-I—————————————————
-I- No bad Links (with logical state = INIT) were found

-I—————————————————
-I- PM Counters Info
-I—————————————————
-I- No illegal PM counters values were found

-I—————————————————
-I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list)
-I—————————————————

-I—————————————————
-I- IPoIB Subnets Check
-I—————————————————
-I- Subnet: IPv4 PKey:0x0001 QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- No members found for group
-I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- No members found for group

-I—————————————————
-I- Bad Links Info
-I- No bad link were found
-I—————————————————
—————————————————————-
-I- Stages Status Report:
    STAGE                                    Errors Warnings
    Bad GUIDs/LIDs Check                     0      0
    Link State Active Check                  0      0

    Performance Counters Report              0      0
    Partitions Check                         0      0
    IPoIB Subnets Check                      0      2

Please see /tmp/ibdiagnet.log for complete log
—————————————————————-

-I- Done. Run time was 15 seconds.



From the above output we can see that there are no issues reported.


Conclusion:
In this article we have learned how to execute various IB Switch commands to identify the IB port errors or issues. 


0