Uncategorized

Exadata: Clear Hardware Fault Post Hardware Replacement

Introduction

I was working on a hardware (Processor) failure on Exadata X5-2 Compute node. There was an Automatic SR generated for the hardware failure, Oracle Field Engineer contacted us for hardware  replacement and replaced the faulty hardware. Everything went smooth until this point. But we noticed that even after the hardware replacement the fault was not cleared automatically. So we ended up clearing the hardware fault manually.

In this article I will demonstrate how to clear a hardware (Processor) fault manually. The same steps can be used for clearing all type of faulty hardware by replacing the hardware name/path.


  • To identify faulty hardware, execute the ILOM following command:

[root@dm01db01 ~]# ipmitool sunoem cli “show -d properties -level all /SYS/MB fault_state==Faulted”
Connected. Use ^D to exit.
-> show -d properties -level all /SYS/MB fault_state==Faulted
  /SYS/MB/P1
    Properties:
        type = Host Processor
        ipmi_name = MB/P1
        fru_name = Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
        fru_version = 02
        fru_part_number = 060F
        fault_state = Faulted
        clear_fault_action = (none)


-> Session closed
Disconnected

From the output above we can see that Processor P1 (/SYS/MS/P1) is faulty and replacement.

You can also check for hardware failures using Web ILOM

Steps to Clear a hardware fault post hardware replacement:


  • Identify the hardware fault

[root@dm01db01 ~]# ipmitool sunoem cli “show -d properties -level all /SYS/MB fault_state==Faulted”
Connected. Use ^D to exit.
-> show -d properties -level all /SYS/MB fault_state==Faulted
  /SYS/MB/P1
    Properties:
        type = Host Processor
        ipmi_name = MB/P1
        fru_name = Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
        fru_version = 02
        fru_part_number = 060F
        fault_state = Faulted
        clear_fault_action = (none)


-> Session closed
Disconnected


  • Connect to problematic Compute node ILOM

[root@dm01db01 ~]# ssh dm01db01-ilom
The authenticity of host ‘dm01db01-ilom (10.10.10.11)’ can’t be established.
RSA key fingerprint is 52:45:af:c4:08:29:c4:6a:15:d9:5f:6d:14:cb:23:b1.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added ‘dm01db01-ilom,10.10.10.11’ (RSA) to the list of known hosts.
Password:

Oracle(R) Integrated Lights Out Manager

Version 3.2.8.24 r114580

Copyright (c) 2016, Oracle and/or its affiliates. All rights reserved.

Warning: HTTPS certificate is set to factory default.

Hostname: dm01db01-ilom

-> show -d properties -level all /SYS/MB fault_state==Faulted
  /SYS/MB/P1
    Properties:
        type = Host Processor
        ipmi_name = MB/P1
        fru_name = Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
        fru_version = 02
        fru_part_number = 060F
        fault_state = Faulted
        clear_fault_action = (none)


  • Execute the following command to clear the fault

-> set /SYS/MB/P1 clear_fault_action=true
Are you sure you want to clear /SYS/MB/P1 (y/n)? y
Set ‘clear_fault_action’ to ‘true’


  • Verify the fault is cleared

-> show -d properties -level all /SYS/MB fault_state==Faulted
show: Query found no matches.

No Faulty hardware found.

Verify from Web ILOM


  • Exit from ILOM

-> exit
Connection to dm01db01-ilom closed.
[root@dm01db01 ~]#

Conclusion

In this article we have learned how to identify the hardware fault and clear it post hardware replacement.