X7 TOI – Vail HBA Engineered Systems / X86 February 16, 2016 Paul Lodrige Engineered Systems / X86 Systems Quality Group 1Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Confidential – Oracle Restricted
Agenda • What is Vail • What’s new for Vail • Vail / HDD issue triage and debugging for the field • References • Case Study – LIVE ! • Vail – in depth – if interest and time exists !
2Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Confidential – Oracle Restricted
What is Vail • Based on Gen 3 LSI SAS 9361-16i controller • Utilizes the Gen 3 LSI 3316 IO controller chip • Supports x8 PCIe 3.0 with 8Gb/s per lane - - same • Supports 16 individual SAS ports operating at 12Gb/s • Backward compatible with previous PCIe and SAS generations 1 & 2 • SAS data transfer rates of 12, 6, and 3Gb/s per lane • On board ESM ( SuperCap) no PM required
3Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Confidential – Oracle Restricted
What’s new for Vail ? ( continued ) Per LSI / Broadcom: “There are not any signifcant changes from a debug / triage perspectie between Aspen and Vail. The two main diferences between these products are 1) Vail adds 8 more SAS ports (alleiiatng the need for expanders) 2) FW jumps from MR 6.3 to MR 6.13. So there's a whole lot of defects that haie been fxed which should proiide beter reliability for the Vail Product.” --- ( pf) And numerous RFE’s haie been implemented !
4Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Confidential – Oracle Restricted
What’s new for Vail ? ( continued ) FW Logs are now persistent across reboots and power cycles. Improies diagnosing issues by seeing actiity leading up to fatal fw faults. Reduced spam of repeatng messages in fw log fle. Improies readability of fw logs. Controller will go into Write Thru mode upon a Driie failure with HBA haiing Pinned cache. Support for different I/O request sizes up to 1MB per request
Improied Error Handling. Preiiously many encountered faults would immediately stop fw forcing power cycle. We now will reset controller iia OCR to allow recoiery from numerous faults. Upgrade HDD fw can now be done multple HDDs at a tme rather than sequentally signifcantly decreasing oierall tme for upgrades.
5Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Confidential – Oracle Restricted
What’s new for Vail ? ( continued ) Improied SuperCap monitoring. Log additonal parameters specifc to SC behaiior. StorCLI adds sanitie to crypto Erase functon Additonal eients being logged to help distnguish cable is HDD errors. Power throtling implemented to help with high temperature situatons.
6Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Confidential – Oracle Restricted
Vail I/O Architecture – same as before
7Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Confidential – Oracle Restricted
Vail Hints, Tips and References Sample Storcli Commands • Controller termlogs – storcli /c0 /show termlog • Controller configuration – storcli /co/v0 /show all
• Events – NEW !!!!!!!!!!!!!!!!!!! • /opt/MegaRAID/storcli/storcli64 /c0 /eall /sall show errorcounters Description = Show Drive/Cable Error Counters Succeeded. Drive
Error counter for Drive Error counter for Slot
/c0/e8/s0
0
0
/c0/e8/s1
0
0
/c0/e8/s2
0
0
8Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Confidential – Oracle Restricted
SAS 1/2/3 LogInfo Decoding [Snippet of FW termlog] 10/05/13 11:14:33: isForeignCfgComplete: MR_CFG - totAr: 0x1, totLd: 0x1, totSpare: 0x0 10/05/13 11:27:51: Disabling UART for 120s due to IDR on devH c 10/05/13 11:27:51: iopiEvent: EVENT_SAS_DEVICE_STATUS_CHANGE 10/05/13 11:27:51: DM_HandleDevStatusChgEvent: devHandle=x000c SASAdd=4433221102000000 TaskTag=xffff ASC=x00 ASCQ=x00 IOCLogInfo x31110d00 IOCStatus x8000 ReasonCode x08 - INTERNAL_DEVICE_RESET
The IOCLogInfo field of the Reply message includes the following subfields. • [31:28] – MPI2_IOCLOGINFO_TYPE_SAS (3) • [27:24] – IOC_LOGINFO_ORIGINATOR: 0 = IOP, 1 = PL, 2 = IR • [23:16] – LOGINFO_CODE • [15:0] – LOGINFO_CODE Specific IOCLogInfo 0x31110d00 3 1 1100 0d00
; MPI2_IOCLOGINFO_TYPE_SAS ; means error generated from the controller PL (protocol layer). Layer below FW -- ie. on the chip ; LogInfo Code = PL_LOGINFO_CODE_RESET ; subcode = PL_LOGINFO_SUB_CODE_SATA_LINK_DOWN SATA direct-attached link went down. 9Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Confidential – Oracle Restricted
Vail Reference Guides You can find these materials in Paul’s Useful Stuff workspace • MRDiag – MegaRaid Diagnostic Tool Users Guide • STORCLI - 12Gb/s MegaRAID® SAS Software Users Guide • Drivers - MegaRaid SAS Device Driver Users Guide • Fusion MPT Fusion-MPT™ 2.5 Message Passing Interface (MPI) Spec. Guide • Firmware Guide - LSISAS MegaRAID® Firmware Functional Specification Guide • SAS 3 Error Codes - SAS Generation 3 Error Codes Systems Engineering Note • SuperCap Events - SuperCap Events doc
.
10Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Confidential – Oracle Restricted
Vail –What’s missing ? • parser.sh --- needs SAS3 & HDD specific updates • Case study • Send me your comments !
.
11Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Confidential – Oracle Restricted
12Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Confidential – Oracle Restricted