Tag: ILOM

  • Step By Step Exadata Storage Cell Rescue Process

    Step By Step Exadata Storage Cell Rescue Process

     
    You will end up performing storage cell rescue under the following situations:

    • Improper Battery Replacement
    • Improper Card Seating
    • Card Damage During Battery Replacement
    • Corrupted Root File System
    In this article we will demonstrate step by step process to Rescue an Exadata Storage Cell or server.
     
    Open a browser and enter the ILOM hostname or IP address of the Storage cell you want to rescue
    https://dm01cel02-ilom.netsoftmate.com
     
    Enter root crendentials

     
    On the left pane under “Remote Control”, click “Redirection”. Select “Use video redirection” and click “Launch Remote Console” button

     
    Click OK
     
     Click OK

     
    Click Continue

     
    Click Run

     
    Click Continue (not recommended)

     
    From the ILOM video console we can see that the root file system can’t be mounted due to corruption and it will be rebooted again in 60 seconds

     
    On the left pane under “Host Management” click on “Power Control”. From the drop down list Select “Power Cycle”

     
    Click Save

     
    Click OK

     
    Rebooting in progress

     
    Server is no rebooting

     
     
    Immediately press Ctrl+S on keyboard 

     
    Select the “CELL_USB_BOOT_CELLBOOT_usb_in_rescue_mode

     
    At the point, we will have continue the rescue process using serial ILOM

     
    As root, ssh to the storage cell ILOM and start the serial console

     
    Enter r and hit return

     
    Enter y and hit return

     
    Enter the rescue password sos1exadata. Enter n and hit return

     
    Enter the root user password 

     
    We are into the rescue mode. At this moment check to make sure that the there are no file system issue. Fix any other issue you may have. Consult Oracle if required
     
    Reboot the server again to complete the rescue process

     
    Hit return

     
    The server is powered off

     
    Power on the server using web ILOM as shown below

     
    Rescue process is completed and we got the root login prompt

     
     
    Login to the server as root user and perform the post rescue steps

      
    Verify the image version of the storage cell

     
     
    Post Storage Cell Rescue steps:
     
    [root@dm01cel02 ~]# imageinfo

    Kernel version: 4.1.12-94.8.4.el6uek.x86_64 #2 SMP Sat May 5 16:14:51 PDT 2018 x86_64
    Cell version: OSS_18.1.7.0.0AUG_LINUX.X64_180821
    Cell rpm version: cell-18.1.7.0.0_LINUX.X64_180821-1.x86_64

    Active image version: 18.1.7.0.0.180821
    Active image kernel version: 4.1.12-94.8.4.el6uek
    Active image activated: 2019-03-17 03:27:41 -0500
    Active image status: success
    Active system partition on device: /dev/md5
    Active software partition on device: /dev/md7

    Cell boot usb partition: /dev/sdm1
    Cell boot usb version: 18.1.7.0.0.180821

    Inactive image version: undefined
    Rollback to the inactive partitions: Impossible

    CellCLI> import celldisk all force
    No cell disks qualified for this import operation

    CellCLI> list physicaldisk
             12:0            PST0XV          normal
             12:1            PZNDSV          normal
             12:2            PT5Z4V          normal
             12:3            PU3XLV          normal
             12:4            PYAKLV          normal
             12:5            PV828V          normal
             12:6            PZE5NV          normal
             12:7            PYV0YV          normal
             12:8            PZKUXV          normal
             12:9            PYD86V          normal
             12:10           PZL15V          normal
             12:11           PZPLAV          normal
             FLASH_1_1       S2T7NCAHA00958  normal
             FLASH_2_1       S2T7NCAHA00986  normal
             FLASH_4_1       S2T7NCAHA00956  normal
             FLASH_5_1       S2T7NCAHA00947  normal

    CellCLI> list celldisk
             CD_00_dm01cel02        normal
             CD_01_dm01cel02        normal
             CD_02_dm01cel02        normal
             CD_03_dm01cel02        normal
             CD_04_dm01cel02        normal
             CD_05_dm01cel02        normal
             CD_06_dm01cel02        normal
             CD_07_dm01cel02        normal
             CD_08_dm01cel02        normal
             CD_09_dm01cel02        normal
             CD_10_dm01cel02        normal
             CD_11_dm01cel02        normal
             FD_00_dm01cel02        normal
             FD_01_dm01cel02        normal
             FD_02_dm01cel02        normal
             FD_03_dm01cel02        normal

    CellCLI> list griddisk
             DATA_DM01_CD_00_dm01cel02     active
             DATA_DM01_CD_01_dm01cel02     active
             DATA_DM01_CD_02_dm01cel02     active
             DATA_DM01_CD_03_dm01cel02     active
             DATA_DM01_CD_04_dm01cel02     active
             DATA_DM01_CD_05_dm01cel02     active
             DATA_DM01_CD_06_dm01cel02     active
             DATA_DM01_CD_07_dm01cel02     active
             DATA_DM01_CD_08_dm01cel02     active
             DATA_DM01_CD_09_dm01cel02     active
             DATA_DM01_CD_10_dm01cel02     active
             DATA_DM01_CD_11_dm01cel02     active
             DBFS_DG_CD_02_dm01cel02       active
             DBFS_DG_CD_03_dm01cel02       active
             DBFS_DG_CD_04_dm01cel02       active
             DBFS_DG_CD_05_dm01cel02       active
             DBFS_DG_CD_06_dm01cel02       active
             DBFS_DG_CD_07_dm01cel02       active
             DBFS_DG_CD_08_dm01cel02       active
             DBFS_DG_CD_09_dm01cel02       active
             DBFS_DG_CD_10_dm01cel02       active
             DBFS_DG_CD_11_dm01cel02       active
             RECO_DM01_CD_00_dm01cel02     active
             RECO_DM01_CD_01_dm01cel02     active
             RECO_DM01_CD_02_dm01cel02     active
             RECO_DM01_CD_03_dm01cel02     active
             RECO_DM01_CD_04_dm01cel02     active
             RECO_DM01_CD_05_dm01cel02     active
             RECO_DM01_CD_06_dm01cel02     active
             RECO_DM01_CD_07_dm01cel02     active
             RECO_DM01_CD_08_dm01cel02     active
             RECO_DM01_CD_09_dm01cel02     active
             RECO_DM01_CD_10_dm01cel02     active
             RECO_DM01_CD_11_dm01cel02     active

    [root@dm01cel02 ~]# cellcli -e list flashcache detail
             name:                   dm01cel02_FLASHCACHE
             cellDisk:               FD_03_dm01cel02,FD_01_dm01cel02,FD_02_dm01cel02,FD_00_dm01cel02
             creationTime:           2019-03-17T03:19:43-05:00
             degradedCelldisks:
             effectiveCacheSize:     11.64312744140625T
             id:                     574c3bd1-7a35-42ba-a03b-75f3a93edac7
             size:                   11.64312744140625T
             status:                 normal

    [root@dm01cel02 ~]# cellcli -e list flashlog detail
             name:                   dm01cel02_FLASHLOG
             cellDisk:               FD_03_dm01cel02,FD_00_dm01cel02,FD_01_dm01cel02,FD_02_dm01cel02
             creationTime:           2019-03-17T03:19:43-05:00
             degradedCelldisks:
             effectiveSize:          512M
             efficiency:             100.0
             id:                     73cd8288-c6d8-42c3-95a1-97ce287cf7d0
             size:                   512M
             status:                 normal

     
    SQL> select a.name,b.path,b.state,b.mode_status,b.failgroup
        from v$asm_diskgroup a, v$asm_disk b
        where a.group_number=b.group_number
        and b.failgroup=’dm01cel02′
        order by 2,1;

    no rows selected

    SQL> alter diskgroup DBFS_DG add disk ‘o/192.168.1.1;192.168.1.2/DBFS_DG_*_dm01cel02’ force;

    Diskgroup altered.

     

    SQL> alter diskgroup DATA_DM01 add disk ‘o/192.168.1.1;192.168.1.2/DATA_DM01_*_dm01cel02’ force;

    Diskgroup altered.

     

    SQL> alter diskgroup RECO_DM01 add disk ‘o/192.168.1.1;192.168.1.2/RECO_DM01_*_dm01cel02’ force;

    Diskgroup altered.


     
    SQL> select * from v$asm_operation;

    GROUP_NUMBER OPERA STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE
    ———— —– —- ———- ———- ———- ———- ———- ———– ——————————————–
               1 REBAL RUN           4          4     204367    3521267      13041         254
               3 REBAL WAIT          4

     

    SQL> select * from v$asm_operation;

    no rows selected

    SQL> col path for a70
    SQL> set lines 200
    SQL> set pages 200
    SQL> select a.name,b.path,b.state,b.mode_status,b.failgroup
        from v$asm_diskgroup a, v$asm_disk b
        where a.group_number=b.group_number
        and b.failgroup=’dm01cel02′
        order by 2,1;  2    3    4    5

    NAME                           PATH                                                                   STATE    MODE_ST FAILGROUP
    —————————— ———————————————————————- ——– ——- ——————————
    DATA_DM01                     o/192.168.1.1;192.168.1.2/DATA_DM01_CD_00_dm01cel02              NORMAL   ONLINE  dm01cel02
    DATA_DM01                     o/192.168.1.1;192.168.1.2/DATA_DM01_CD_01_dm01cel02              NORMAL   ONLINE  dm01cel02
    DATA_DM01                     o/192.168.1.1;192.168.1.2/DATA_DM01_CD_02_dm01cel02              NORMAL   ONLINE  dm01cel02
    DATA_DM01                     o/192.168.1.1;192.168.1.2/DATA_DM01_CD_03_dm01cel02              NORMAL   ONLINE  dm01cel02
    DATA_DM01                     o/192.168.1.1;192.168.1.2/DATA_DM01_CD_04_dm01cel02              NORMAL   ONLINE  dm01cel02
    DATA_DM01                     o/192.168.1.1;192.168.1.2/DATA_DM01_CD_05_dm01cel02              NORMAL   ONLINE  dm01cel02
    DATA_DM01                     o/192.168.1.1;192.168.1.2/DATA_DM01_CD_06_dm01cel02              NORMAL   ONLINE  dm01cel02
    DATA_DM01                     o/192.168.1.1;192.168.1.2/DATA_DM01_CD_07_dm01cel02              NORMAL   ONLINE  dm01cel02
    DATA_DM01                     o/192.168.1.1;192.168.1.2/DATA_DM01_CD_08_dm01cel02              NORMAL   ONLINE  dm01cel02
    DATA_DM01                     o/192.168.1.1;192.168.1.2/DATA_DM01_CD_09_dm01cel02              NORMAL   ONLINE  dm01cel02
    DATA_DM01                     o/192.168.1.1;192.168.1.2/DATA_DM01_CD_10_dm01cel02              NORMAL   ONLINE  dm01cel02
    DATA_DM01                     o/192.168.1.1;192.168.1.2/DATA_DM01_CD_11_dm01cel02              NORMAL   ONLINE  dm01cel02
    DBFS_DG                        o/192.168.1.1;192.168.1.2/DBFS_DG_CD_02_dm01cel02                 NORMAL   ONLINE  dm01cel02
    DBFS_DG                        o/192.168.1.1;192.168.1.2/DBFS_DG_CD_03_dm01cel02                 NORMAL   ONLINE  dm01cel02
    DBFS_DG                        o/192.168.1.1;192.168.1.2/DBFS_DG_CD_04_dm01cel02                 NORMAL   ONLINE  dm01cel02
    DBFS_DG                        o/192.168.1.1;192.168.1.2/DBFS_DG_CD_05_dm01cel02                 NORMAL   ONLINE  dm01cel02
    DBFS_DG                        o/192.168.1.1;192.168.1.2/DBFS_DG_CD_06_dm01cel02                 NORMAL   ONLINE  dm01cel02
    DBFS_DG                        o/192.168.1.1;192.168.1.2/DBFS_DG_CD_07_dm01cel02                 NORMAL   ONLINE  dm01cel02
    DBFS_DG                        o/192.168.1.1;192.168.1.2/DBFS_DG_CD_08_dm01cel02                 NORMAL   ONLINE  dm01cel02
    DBFS_DG                        o/192.168.1.1;192.168.1.2/DBFS_DG_CD_09_dm01cel02                 NORMAL   ONLINE  dm01cel02
    DBFS_DG                        o/192.168.1.1;192.168.1.2/DBFS_DG_CD_10_dm01cel02                 NORMAL   ONLINE  dm01cel02
    DBFS_DG                        o/192.168.1.1;192.168.1.2/DBFS_DG_CD_11_dm01cel02                 NORMAL   ONLINE  dm01cel02
    RECO_DM01                     o/192.168.1.1;192.168.1.2/RECO_DM01_CD_00_dm01cel02              NORMAL   ONLINE  dm01cel02
    RECO_DM01                     o/192.168.1.1;192.168.1.2/RECO_DM01_CD_01_dm01cel02              NORMAL   ONLINE  dm01cel02
    RECO_DM01                     o/192.168.1.1;192.168.1.2/RECO_DM01_CD_02_dm01cel02              NORMAL   ONLINE  dm01cel02
    RECO_DM01                     o/192.168.1.1;192.168.1.2/RECO_DM01_CD_03_dm01cel02              NORMAL   ONLINE  dm01cel02
    RECO_DM01                     o/192.168.1.1;192.168.1.2/RECO_DM01_CD_04_dm01cel02              NORMAL   ONLINE  dm01cel02
    RECO_DM01                     o/192.168.1.1;192.168.1.2/RECO_DM01_CD_05_dm01cel02              NORMAL   ONLINE  dm01cel02
    RECO_DM01                     o/192.168.1.1;192.168.1.2/RECO_DM01_CD_06_dm01cel02              NORMAL   ONLINE  dm01cel02
    RECO_DM01                     o/192.168.1.1;192.168.1.2/RECO_DM01_CD_07_dm01cel02              NORMAL   ONLINE  dm01cel02
    RECO_DM01                     o/192.168.1.1;192.168.1.2/RECO_DM01_CD_08_dm01cel02              NORMAL   ONLINE  dm01cel02
    RECO_DM01                     o/192.168.1.1;192.168.1.2/RECO_DM01_CD_09_dm01cel02              NORMAL   ONLINE  dm01cel02
    RECO_DM01                     o/192.168.1.1;192.168.1.2/RECO_DM01_CD_10_dm01cel02              NORMAL   ONLINE  dm01cel02
    RECO_DM01                     o/192.168.1.1;192.168.1.2/RECO_DM01_CD_11_dm01cel02              NORMAL   ONLINE  dm01cel02

    34 rows selected.
     

     
    Conclusion
     
    In this article we have demonstrated step by step procedure to perform Storage Cell Rescue. You may have to perform the Storage cell rescue for multiple reasons such as root file system corrupted, Kernel panic, server rebooting continuously and so on. With the help of CELLBOOT usb one can perform the storage cell rescue very easily.
     
  • How To Clear Hardware Fault on Exadata Infiniband Switch Manually

    Introduction

    We had a FAN failure on Exadata Infiniband Switch (FAN2). Scheduled the faulty hardware replacement with Oracle. The Oracle Feild Engineer came to the Customer Data Center and replaced the faulty FAN on Infiniband Switch. The FAN replacement was successful however the fault was not cleared automatically. We can still see the FAN was marked faulted from Infiniband BUI and CLI.

    From Infiniband Browser User Interface


    In this article we will demonstrate how to clear the fault on Infiniband Switch after hardware replacement.


    • Login to the Infiniband switch using Putty as root user and check the Infiniband health. From the output below we can see the FANs are all good.

    [root@dm01sw-iba01 ~]# env_test
    Environment test started:
    Starting Environment Daemon test:
    Environment daemon running
    Environment Daemon test returned OK
    Starting Voltage test:
    Voltage ECB OK
    Measured 3.3V Main = 3.28 V
    Measured 3.3V Standby = 3.39 V
    Measured 12V = 11.97 V
    Measured 5V = 5.02 V
    Measured VBAT = 3.14 V
    Measured 2.5V = 2.49 V
    Measured 1.8V = 1.79 V
    Measured I4 1.2V = 1.22 V
    Voltage test returned OK
    Starting PSU test:
    PSU 0 present OK
    PSU 1 present OK
    PSU test returned OK
    Starting Temperature test:
    Back temperature 40
    Front temperature 41
    SP temperature 57
    Switch temperature 55, maxtemperature 59
    Temperature test returned OK
    Starting FAN test:
    Fan 0 not present
    Fan 1 running at rpm 17004
    Fan 2 running at rpm 15696
    Fan 3 running at rpm 17004
    Fan 4 not present
    FAN test returned OK
    Starting Connector test:
    Connector test returned OK
    Starting Onboard ibdevice test:
    Switch OK
    All Internal ibdevices OK
    Onboard ibdevice test returned OK
    Starting SSD test:
    SSD test returned OK
    Starting Auto-link-disable test:
    Auto-link-disable test returned OK
    Environment test PASSED

    • Check the FAN Speed. FAN looks good.

    [root@dm01sw-iba01 ~]# getfanspeed
    Fan 0 not present
    Fan 1 running at rpm 17004
    Fan 2 running at rpm 15478
    Fan 3 running at rpm 17004
    Fan 4 not present

    • Switch to the ilom-admin user

    [root@dm01sw-iba01 ~]# su – ilom-admin

    Oracle(R) Integrated Lights Out Manager

    Version 2.2.9-3 ILOM 3.2.11 r124039

    Copyright (c) 2018, Oracle and/or its affiliates. All rights reserved.

    Warning: HTTPS certificate is set to factory default.

    Hostname: dm01sw-iba01.netsoftmate.com

    ->

    • Now check the fault table for any faulty components. Now we can see the FAN2 is Faulted though the FAN was replaced with a new FAN.

    -> show / -a -l 4 -o table fault_state
    Target                                  | Property                                     | Value
    —————————————-+———————————————-+——————————————————————–
    /SYS                                    | fault_state                                  | OK
    /SYS/MB                                 | fault_state                                  | OK
    /SYS/PSU0                               | fault_state                                  | OK
    /SYS/PSU1                               | fault_state                                  | OK
    /SYS/FAN1                               | fault_state                                  | OK
    /SYS/FAN2                               | fault_state                                  | Faulted /SYS/FAN3                               | fault_state                                  | OK

    ->

    • You can also execute the below command to identify the fault

    -> show -d targets /SP/faultmgmt

     /SP/faultmgmt
        Targets:
            shell
            0 (/SYS/FAN2)

    • Clear the Fault as show below

    -> set /SYS/FAN2 clear_fault_action=true
    Are you sure you want to clear /SYS/FAN2 (y/n)? y
    Set ‘clear_fault_action’ to ‘true’

    • Verify the fault is cleared

    -> show / -a -l 4 -o table fault_state
    Target                                  | Property                                     | Value
    —————————————-+———————————————-+——————————————————————–
    /SYS                                    | fault_state                                  | OK
    /SYS/MB                                 | fault_state                                  | OK
    /SYS/PSU0                               | fault_state                                  | OK
    /SYS/PSU1                               | fault_state                                  | OK
    /SYS/FAN1                               | fault_state                                  | OK
    /SYS/FAN2                               | fault_state                               
       | OK
    /SYS/FAN3                               | fault_state                                  | OK

    -> show -d targets /SP/faultmgmt

     /SP/faultmgmt
        Targets:
            shell

    • Verify from the Infiniband Band BUI

    Conclusion

    In this article we have learned how to identify the fault and clear it manually on an Exadata Infiniband Switch. The ILOM commands comes handy for clearing the fault. You can also clear the fault using the Browser User Interface (BUI).

  • Exadata – Display and Clear Fault using Fault Manager (faultmgmt)

    Exadata Database Machine consists of a storage grid, compute grid, and network grid. Each grid, or hardware layer, is built with multiple high-performing, industry-standard Oracle servers to provide hardware and system fault tolerance. The hardware components are subjected to failure. Most common failure on Exadata is Hard Disk failure on Storage Cells. With the latest generation of Exadata the hardware failures are very minimal and less troublesome. 


    The Exadata Storage Cells and Compute nodes consists of several hardware components, such as:

    • Hard disk
    • Flash disk 
    • Physical Memory
    • Processor  
    • IB ports
    • Mother Board
    • Batteries
    • Power Supply
    • and So on





    In this article we will demonstrate how to view the hardware fault and clear it using ILOM fault manager (faultmgmt).




    Steps to display and clear hardware fault using faultmgmt:




    Step 1: Login to compute node ILOM where the fault occurred


    [root@dm01db01 ~]# ssh dm01db02-ilom
    Password:


    Oracle(R) Integrated Lights Out Manager


    Version 4.0.0.24 r121523


    Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.


    Warning: HTTPS certificate is set to factory default.


    Hostname: dm01db02-ilom


    Step 2: Check if the fault manager is supported. If you get the output like below then fault manager is supported.

    -> show /SP/faultmgmt/shell


     /SP/faultmgmt/shell
        Targets:


        Properties:


        Commands:
            cd
            show
            start




    Step 3: Start the fault manager shell


    -> start /SP/faultmgmt/shell
    Are you sure you want to start /SP/faultmgmt/shell (y/n)? y


    Step 4: Execute the following command to display the fault. Here we can see that there is no issue with hardware but he ILOM file system is 100% full.


    faultmgmtsp> fmadm faulty
    ——————- ———————————— ————– ——–
    Time                UUID                                 msgid          Severity
    ——————- ———————————— ————– ——–
    2018-06-17/15:55:32 2a854ad2-4a31-e829-e26c-c84ba212d7f2 ILOM-8000-JV   Major


    Problem Status           : open
    Diag Engine              : fdd 1.0
    System
       Manufacturer          : Oracle Corporation
       Name                  : Exadata X5-2
       Part_Number           : Exadata X5-2
       Serial_Number         : AK00XXXXXX


    System Component
       Manufacturer          : Oracle Corporation
       Name                  : ORACLE SERVER X5-2
       Part_Number           : 7090664
       Serial_Number         : 15XXXXXXXX
       Firmware_Manufacturer : Oracle Corporation
       Firmware_Version      : (ILOM)4.0.0.24
       Firmware_Release      : (ILOM)2017.09.23


    —————————————-
    Suspect 1 of 1
       Problem class  : defect.ilom.fs.full
       Certainty      : 100%
       Affects        : /SYS/SP
       Status         : faulted


       FRU
          Status            : faulty
          Location          : /SYS/SP
          Manufacturer      : Oracle Corporation
          Name              : SP
          Part_Number       : PILOT3
          Chassis
             Manufacturer   : Oracle Corporation
             Name           : ORACLE SERVER X5-2
             Part_Number    : 7090664
             Serial_Number  : 1547NM10CX


    Description : An ILOM filesystem has exceeded the filesystem capacity
                  limit.


    Response    : The chassis wide service-required LED will be illuminated.


    Impact      : ILOM commands may fail, especially those which make
                  configuration changes.


    Action      : Please refer to the associated reference document at
                  http://support.oracle.com/msg/ILOM-8000-JV for the latest
                  service procedures and policies regarding this diagnosis.


    Step 5: Execute the below command to clear the fault


    faultmgmtsp> fmadm acquit UUID –> Get the UUID from the from output of the above command.


    faultmgmtsp> fmadm acquit 2a854ad2-4a31-e829-e26c-c84ba212d7f2


    Step 6: Verify that the fault is cleared


    faultmgmtsp> fmadm faulty
    No faults found


    Step 7: Exit from the fault manager


    faultmgmtsp> exit


    Step 8: Reset the ILOM service processor


    -> reset /SP
    Are you sure you want to reset /SP (y/n)? y
    Performing reset on /SP


    Step 9: Exit from the ILOM


    -> exit
    Connection to dm01db02-ilom closed.


    Step 10: Connec to ILOM and verify the ILOM SP is restarted


    [root@dm01db01 ~]# ssh dm01db02-ilom
    Password:


    Oracle(R) Integrated Lights Out Manager


    Version 4.0.0.24 r121523


    Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.


    Warning: HTTPS certificate is set to factory default.


    Hostname: dm01db02-ilom




    -> show -d properties /SP/clock uptime


     /SP/clock
        Properties:
            uptime = 0 days, 00:08:02




    Conclusion


    In this article we have learned how to display and clear a fault using fault manager (faultmgmt). The Fault Management Shell is the preferred method for displaying the details of a diagnosed fault. faultmgmt support for command shell varies depending ILOM release level and server product model. 

  • Oracle Database Appliance Pocket Reference Guide

    Here is the link to download Oracle Database Appliance Pocket Reference Guide

    Oracle Database Appliance Pocket Reference Guide

  • Boot Exadata Compute Node With Diagnostic Image Using ILOM Console Remotely

    Introduction



    Exadata Compute nodes and storage cells come with an Integrated Lights Out Manager or simply ILOM. ILOM is an adapter card in each compute node and storage cell that operates independently of the operating system. The ILOM boots up as soon as power is applied to the server and provides web and SSH access through the management network. Using ILOM you can perform several tasks remotely that would otherwise require physical access to the servers, this includes access to the remote console, attach diag.iso image, power server on and off, and rebooting or resetting the server. Also ILOM monitors the configuration and server internal hardware components.

    In this article I will demonstrate step by step procedure on how to boot Exadata Compute node with the diagnostic ISO image using Web ILOM Remotely.

    Steps to Mount diag.iso On An Exadata Compute Node Using ILOM Console:

    • Copy/Download diag.iso to desktop machine
    You can copy the diag.iso image from a good working compute node or storage cell to the desktop as shown below. You have an option to download image file from MOS note as well. See the MOS note 2001454.1 for more details.

    Locate the diag.iso image on a good working server


    Using WinScp copy the diag.iso image to the desktop/laptop






    • Connect to the WEB ILOM as shown below
    Open a web browser and enter the ILOM Hostname or ILOM IP address you want to attached diag.iso image

    Enter the root user credentials


    This is the ILOM home page which brief information about the server. On the left pane you have different options that you can use to manage the server remotely.



    • Launch Remote Console by following the steps below
    On the left pane Expand “Remote Control” and Click on “Redirection”

    Click on “Launch Remote Console” button


    Click Ok


    Click Ok


    Click Continue


    Click Run


    Now we can access the server remotely.



    • Attach the diag.iso to Remote Console as follows
    Click “KVMS” and Click on “Storage”

    Click “Add” Button


    Select the “diag.iso” image file on the local desktop/laptop


    Click Ok



    • Reboot the Server to boot from ISO image as follows
    On the left pane expand “Host Management”, Click on “Host Control” and Select “CDROM” as Next Boot Device and Click Save button

    On the left pane expand “Host Management”, Click on “Power Control” and Select “Power Cycle” and Click Save button


    Click Ok



    • Now the System is booting from diag.iso image

    • Perform the desired action
    Enter ‘e’ to enter into interactive mode or
    Enter ‘r’ to perform a system restore from NFS backup

    At this stage the server booted from diag.iso, enter into interactive mode, restore/recovery the machine or correct the OS configuration that you have and when you are done, disable redirection.


    Conclusion


    In this article we have learned how to mount diag.iso on Exadata compute node and boot Exadata Compute node. Using ILOM you can perform several tasks remotely that would otherwise require physical access to the servers, this includes access to the console, power server on and off, and rebooting or resetting the server.

  • Exadata: Clear Hardware Fault Post Hardware Replacement

    Introduction


    I was working on a hardware (Processor) failure on Exadata X5-2 Compute node. There was an Automatic SR generated for the hardware failure, Oracle Field Engineer contacted us for hardware  replacement and replaced the faulty hardware. Everything went smooth until this point. But we noticed that even after the hardware replacement the fault was not cleared automatically. So we ended up clearing the hardware fault manually.


    In this article I will demonstrate how to clear a hardware (Processor) fault manually. The same steps can be used for clearing all type of faulty hardware by replacing the hardware name/path.


    • To identify faulty hardware, execute the ILOM following command:

    [root@dm01db01 ~]# ipmitool sunoem cli “show -d properties -level all /SYS/MB fault_state==Faulted”
    Connected. Use ^D to exit.
    -> show -d properties -level all /SYS/MB fault_state==Faulted
      /SYS/MB/P1
        Properties:
            type = Host Processor
            ipmi_name = MB/P1
            fru_name = Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
            fru_version = 02
            fru_part_number = 060F
            fault_state = Faulted
            clear_fault_action = (none)




    -> Session closed
    Disconnected


    From the output above we can see that Processor P1 (/SYS/MS/P1) is faulty and replacement.


    You can also check for hardware failures using Web ILOM



    Steps to Clear a hardware fault post hardware replacement:


    • Identify the hardware fault

    [root@dm01db01 ~]# ipmitool sunoem cli “show -d properties -level all /SYS/MB fault_state==Faulted”
    Connected. Use ^D to exit.
    -> show -d properties -level all /SYS/MB fault_state==Faulted
      /SYS/MB/P1
        Properties:
            type = Host Processor
            ipmi_name = MB/P1
            fru_name = Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
            fru_version = 02
            fru_part_number = 060F
            fault_state = Faulted
            clear_fault_action = (none)




    -> Session closed
    Disconnected


    • Connect to problematic Compute node ILOM

    [root@dm01db01 ~]# ssh dm01db01-ilom
    The authenticity of host ‘dm01db01-ilom (10.10.10.11)’ can’t be established.
    RSA key fingerprint is 52:45:af:c4:08:29:c4:6a:15:d9:5f:6d:14:cb:23:b1.
    Are you sure you want to continue connecting (yes/no)? yes
    Warning: Permanently added ‘dm01db01-ilom,10.10.10.11’ (RSA) to the list of known hosts.
    Password:


    Oracle(R) Integrated Lights Out Manager


    Version 3.2.8.24 r114580


    Copyright (c) 2016, Oracle and/or its affiliates. All rights reserved.


    Warning: HTTPS certificate is set to factory default.


    Hostname: dm01db01-ilom


    -> show -d properties -level all /SYS/MB fault_state==Faulted
      /SYS/MB/P1
        Properties:
            type = Host Processor
            ipmi_name = MB/P1
            fru_name = Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
            fru_version = 02
            fru_part_number = 060F
            fault_state = Faulted
            clear_fault_action = (none)


    • Execute the following command to clear the fault

    -> set /SYS/MB/P1 clear_fault_action=true
    Are you sure you want to clear /SYS/MB/P1 (y/n)? y
    Set ‘clear_fault_action’ to ‘true’


    • Verify the fault is cleared

    -> show -d properties -level all /SYS/MB fault_state==Faulted
    show: Query found no matches.


    No Faulty hardware found.


    Verify from Web ILOM



    • Exit from ILOM

    -> exit
    Connection to dm01db01-ilom closed.
    [root@dm01db01 ~]#


    Conclusion


    In this article we have learned how to identify the hardware fault and clear it post hardware replacement.