Tag: Clear fault

  • How To Clear Hardware Fault on Exadata Infiniband Switch Manually

    Introduction

    We had a FAN failure on Exadata Infiniband Switch (FAN2). Scheduled the faulty hardware replacement with Oracle. The Oracle Feild Engineer came to the Customer Data Center and replaced the faulty FAN on Infiniband Switch. The FAN replacement was successful however the fault was not cleared automatically. We can still see the FAN was marked faulted from Infiniband BUI and CLI.

    From Infiniband Browser User Interface


    In this article we will demonstrate how to clear the fault on Infiniband Switch after hardware replacement.


    • Login to the Infiniband switch using Putty as root user and check the Infiniband health. From the output below we can see the FANs are all good.

    [root@dm01sw-iba01 ~]# env_test
    Environment test started:
    Starting Environment Daemon test:
    Environment daemon running
    Environment Daemon test returned OK
    Starting Voltage test:
    Voltage ECB OK
    Measured 3.3V Main = 3.28 V
    Measured 3.3V Standby = 3.39 V
    Measured 12V = 11.97 V
    Measured 5V = 5.02 V
    Measured VBAT = 3.14 V
    Measured 2.5V = 2.49 V
    Measured 1.8V = 1.79 V
    Measured I4 1.2V = 1.22 V
    Voltage test returned OK
    Starting PSU test:
    PSU 0 present OK
    PSU 1 present OK
    PSU test returned OK
    Starting Temperature test:
    Back temperature 40
    Front temperature 41
    SP temperature 57
    Switch temperature 55, maxtemperature 59
    Temperature test returned OK
    Starting FAN test:
    Fan 0 not present
    Fan 1 running at rpm 17004
    Fan 2 running at rpm 15696
    Fan 3 running at rpm 17004
    Fan 4 not present
    FAN test returned OK
    Starting Connector test:
    Connector test returned OK
    Starting Onboard ibdevice test:
    Switch OK
    All Internal ibdevices OK
    Onboard ibdevice test returned OK
    Starting SSD test:
    SSD test returned OK
    Starting Auto-link-disable test:
    Auto-link-disable test returned OK
    Environment test PASSED

    • Check the FAN Speed. FAN looks good.

    [root@dm01sw-iba01 ~]# getfanspeed
    Fan 0 not present
    Fan 1 running at rpm 17004
    Fan 2 running at rpm 15478
    Fan 3 running at rpm 17004
    Fan 4 not present

    • Switch to the ilom-admin user

    [root@dm01sw-iba01 ~]# su – ilom-admin

    Oracle(R) Integrated Lights Out Manager

    Version 2.2.9-3 ILOM 3.2.11 r124039

    Copyright (c) 2018, Oracle and/or its affiliates. All rights reserved.

    Warning: HTTPS certificate is set to factory default.

    Hostname: dm01sw-iba01.netsoftmate.com

    ->

    • Now check the fault table for any faulty components. Now we can see the FAN2 is Faulted though the FAN was replaced with a new FAN.

    -> show / -a -l 4 -o table fault_state
    Target                                  | Property                                     | Value
    —————————————-+———————————————-+——————————————————————–
    /SYS                                    | fault_state                                  | OK
    /SYS/MB                                 | fault_state                                  | OK
    /SYS/PSU0                               | fault_state                                  | OK
    /SYS/PSU1                               | fault_state                                  | OK
    /SYS/FAN1                               | fault_state                                  | OK
    /SYS/FAN2                               | fault_state                                  | Faulted /SYS/FAN3                               | fault_state                                  | OK

    ->

    • You can also execute the below command to identify the fault

    -> show -d targets /SP/faultmgmt

     /SP/faultmgmt
        Targets:
            shell
            0 (/SYS/FAN2)

    • Clear the Fault as show below

    -> set /SYS/FAN2 clear_fault_action=true
    Are you sure you want to clear /SYS/FAN2 (y/n)? y
    Set ‘clear_fault_action’ to ‘true’

    • Verify the fault is cleared

    -> show / -a -l 4 -o table fault_state
    Target                                  | Property                                     | Value
    —————————————-+———————————————-+——————————————————————–
    /SYS                                    | fault_state                                  | OK
    /SYS/MB                                 | fault_state                                  | OK
    /SYS/PSU0                               | fault_state                                  | OK
    /SYS/PSU1                               | fault_state                                  | OK
    /SYS/FAN1                               | fault_state                                  | OK
    /SYS/FAN2                               | fault_state                               
       | OK
    /SYS/FAN3                               | fault_state                                  | OK

    -> show -d targets /SP/faultmgmt

     /SP/faultmgmt
        Targets:
            shell

    • Verify from the Infiniband Band BUI

    Conclusion

    In this article we have learned how to identify the fault and clear it manually on an Exadata Infiniband Switch. The ILOM commands comes handy for clearing the fault. You can also clear the fault using the Browser User Interface (BUI).

  • Exadata – Display and Clear Fault using Fault Manager (faultmgmt)

    Exadata Database Machine consists of a storage grid, compute grid, and network grid. Each grid, or hardware layer, is built with multiple high-performing, industry-standard Oracle servers to provide hardware and system fault tolerance. The hardware components are subjected to failure. Most common failure on Exadata is Hard Disk failure on Storage Cells. With the latest generation of Exadata the hardware failures are very minimal and less troublesome. 


    The Exadata Storage Cells and Compute nodes consists of several hardware components, such as:

    • Hard disk
    • Flash disk 
    • Physical Memory
    • Processor  
    • IB ports
    • Mother Board
    • Batteries
    • Power Supply
    • and So on





    In this article we will demonstrate how to view the hardware fault and clear it using ILOM fault manager (faultmgmt).




    Steps to display and clear hardware fault using faultmgmt:




    Step 1: Login to compute node ILOM where the fault occurred


    [root@dm01db01 ~]# ssh dm01db02-ilom
    Password:


    Oracle(R) Integrated Lights Out Manager


    Version 4.0.0.24 r121523


    Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.


    Warning: HTTPS certificate is set to factory default.


    Hostname: dm01db02-ilom


    Step 2: Check if the fault manager is supported. If you get the output like below then fault manager is supported.

    -> show /SP/faultmgmt/shell


     /SP/faultmgmt/shell
        Targets:


        Properties:


        Commands:
            cd
            show
            start




    Step 3: Start the fault manager shell


    -> start /SP/faultmgmt/shell
    Are you sure you want to start /SP/faultmgmt/shell (y/n)? y


    Step 4: Execute the following command to display the fault. Here we can see that there is no issue with hardware but he ILOM file system is 100% full.


    faultmgmtsp> fmadm faulty
    ——————- ———————————— ————– ——–
    Time                UUID                                 msgid          Severity
    ——————- ———————————— ————– ——–
    2018-06-17/15:55:32 2a854ad2-4a31-e829-e26c-c84ba212d7f2 ILOM-8000-JV   Major


    Problem Status           : open
    Diag Engine              : fdd 1.0
    System
       Manufacturer          : Oracle Corporation
       Name                  : Exadata X5-2
       Part_Number           : Exadata X5-2
       Serial_Number         : AK00XXXXXX


    System Component
       Manufacturer          : Oracle Corporation
       Name                  : ORACLE SERVER X5-2
       Part_Number           : 7090664
       Serial_Number         : 15XXXXXXXX
       Firmware_Manufacturer : Oracle Corporation
       Firmware_Version      : (ILOM)4.0.0.24
       Firmware_Release      : (ILOM)2017.09.23


    —————————————-
    Suspect 1 of 1
       Problem class  : defect.ilom.fs.full
       Certainty      : 100%
       Affects        : /SYS/SP
       Status         : faulted


       FRU
          Status            : faulty
          Location          : /SYS/SP
          Manufacturer      : Oracle Corporation
          Name              : SP
          Part_Number       : PILOT3
          Chassis
             Manufacturer   : Oracle Corporation
             Name           : ORACLE SERVER X5-2
             Part_Number    : 7090664
             Serial_Number  : 1547NM10CX


    Description : An ILOM filesystem has exceeded the filesystem capacity
                  limit.


    Response    : The chassis wide service-required LED will be illuminated.


    Impact      : ILOM commands may fail, especially those which make
                  configuration changes.


    Action      : Please refer to the associated reference document at
                  http://support.oracle.com/msg/ILOM-8000-JV for the latest
                  service procedures and policies regarding this diagnosis.


    Step 5: Execute the below command to clear the fault


    faultmgmtsp> fmadm acquit UUID –> Get the UUID from the from output of the above command.


    faultmgmtsp> fmadm acquit 2a854ad2-4a31-e829-e26c-c84ba212d7f2


    Step 6: Verify that the fault is cleared


    faultmgmtsp> fmadm faulty
    No faults found


    Step 7: Exit from the fault manager


    faultmgmtsp> exit


    Step 8: Reset the ILOM service processor


    -> reset /SP
    Are you sure you want to reset /SP (y/n)? y
    Performing reset on /SP


    Step 9: Exit from the ILOM


    -> exit
    Connection to dm01db02-ilom closed.


    Step 10: Connec to ILOM and verify the ILOM SP is restarted


    [root@dm01db01 ~]# ssh dm01db02-ilom
    Password:


    Oracle(R) Integrated Lights Out Manager


    Version 4.0.0.24 r121523


    Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.


    Warning: HTTPS certificate is set to factory default.


    Hostname: dm01db02-ilom




    -> show -d properties /SP/clock uptime


     /SP/clock
        Properties:
            uptime = 0 days, 00:08:02




    Conclusion


    In this article we have learned how to display and clear a fault using fault manager (faultmgmt). The Fault Management Shell is the preferred method for displaying the details of a diagnosed fault. faultmgmt support for command shell varies depending ILOM release level and server product model.