Tag: Hardware Fault

  • Exadata: Clear Hardware Fault Post Hardware Replacement

    Introduction


    I was working on a hardware (Processor) failure on Exadata X5-2 Compute node. There was an Automatic SR generated for the hardware failure, Oracle Field Engineer contacted us for hardware  replacement and replaced the faulty hardware. Everything went smooth until this point. But we noticed that even after the hardware replacement the fault was not cleared automatically. So we ended up clearing the hardware fault manually.


    In this article I will demonstrate how to clear a hardware (Processor) fault manually. The same steps can be used for clearing all type of faulty hardware by replacing the hardware name/path.


    • To identify faulty hardware, execute the ILOM following command:

    [root@dm01db01 ~]# ipmitool sunoem cli “show -d properties -level all /SYS/MB fault_state==Faulted”
    Connected. Use ^D to exit.
    -> show -d properties -level all /SYS/MB fault_state==Faulted
      /SYS/MB/P1
        Properties:
            type = Host Processor
            ipmi_name = MB/P1
            fru_name = Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
            fru_version = 02
            fru_part_number = 060F
            fault_state = Faulted
            clear_fault_action = (none)




    -> Session closed
    Disconnected


    From the output above we can see that Processor P1 (/SYS/MS/P1) is faulty and replacement.


    You can also check for hardware failures using Web ILOM



    Steps to Clear a hardware fault post hardware replacement:


    • Identify the hardware fault

    [root@dm01db01 ~]# ipmitool sunoem cli “show -d properties -level all /SYS/MB fault_state==Faulted”
    Connected. Use ^D to exit.
    -> show -d properties -level all /SYS/MB fault_state==Faulted
      /SYS/MB/P1
        Properties:
            type = Host Processor
            ipmi_name = MB/P1
            fru_name = Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
            fru_version = 02
            fru_part_number = 060F
            fault_state = Faulted
            clear_fault_action = (none)




    -> Session closed
    Disconnected


    • Connect to problematic Compute node ILOM

    [root@dm01db01 ~]# ssh dm01db01-ilom
    The authenticity of host ‘dm01db01-ilom (10.10.10.11)’ can’t be established.
    RSA key fingerprint is 52:45:af:c4:08:29:c4:6a:15:d9:5f:6d:14:cb:23:b1.
    Are you sure you want to continue connecting (yes/no)? yes
    Warning: Permanently added ‘dm01db01-ilom,10.10.10.11’ (RSA) to the list of known hosts.
    Password:


    Oracle(R) Integrated Lights Out Manager


    Version 3.2.8.24 r114580


    Copyright (c) 2016, Oracle and/or its affiliates. All rights reserved.


    Warning: HTTPS certificate is set to factory default.


    Hostname: dm01db01-ilom


    -> show -d properties -level all /SYS/MB fault_state==Faulted
      /SYS/MB/P1
        Properties:
            type = Host Processor
            ipmi_name = MB/P1
            fru_name = Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
            fru_version = 02
            fru_part_number = 060F
            fault_state = Faulted
            clear_fault_action = (none)


    • Execute the following command to clear the fault

    -> set /SYS/MB/P1 clear_fault_action=true
    Are you sure you want to clear /SYS/MB/P1 (y/n)? y
    Set ‘clear_fault_action’ to ‘true’


    • Verify the fault is cleared

    -> show -d properties -level all /SYS/MB fault_state==Faulted
    show: Query found no matches.


    No Faulty hardware found.


    Verify from Web ILOM



    • Exit from ILOM

    -> exit
    Connection to dm01db01-ilom closed.
    [root@dm01db01 ~]#


    Conclusion


    In this article we have learned how to identify the hardware fault and clear it post hardware replacement.