The Analysis of Drive Issues

From Unraid | Docs
Revision as of 23:22, 27 March 2009 by RobJ (talk | contribs) (→‎Firmware upgrades: add link (unbricking))
Jump to: navigation, search


page under construction

There are a number of drive related errors, and many of them are similar, but point to very different issues. Some experience in analyzing these errors is therefore recommended, because the steps to resolve the problem are highly dependent on what the real problem is. For example, some errors point to a bad cable, other very similar messages point to a failing drive, and others may point to a bad disk controller, poor or incorrect driver, bad or insufficient power to the drive, etc. There have been too many cases of drives thrown out or returned by an RMA process, when the problem was just a bad cable. This page has been designed to help with the analysis of drive problems, and often to recommend the next steps to take.

The two most important tools for initial analysis are the syslog and the SMART report.


Drive problems by keyword


  • 10B8B
    • "10 bit to 8 bit" error flag
    • see the Drive interface issues below
  • ATA bus error
  • BadCRC
    • often indicates a bad cable
    • check each of the Drive interface issues below
  • Directory bread
  • DisPar
    • see the Drive interface issues below
  • DRDY
    • Drive ReaDY flag, not a problem so ignore it
  • frozen
    • means the exception handler is 'frozen' while dealing with the error; uninformative so just ignore it
  • Handshk
    • Handshake error flag
    • see the Drive interface issues below
  • HostInt
    • Host interface error flag
    • see the Drive interface issues below
  • IDNF
    • sector ID Not Found error flag
    • see ...
  • interface fatal error
    • see the Drive interface issues below
    • see ...
  • media error
    • generally indicates a bad sector, but should be confirmed by an increase in the REALLOC's and/or CURRENT_PENDING's on the SMART report
    • see Drive media issue #1
  • timeout
    • see ...
  • UNC
    • UNCorrectable media error flag
    • see ...



Drive problems by error message


There are many kinds of drive errors, examine each section below for
the highlighted words that most closely match the errors you have.



Unexpected loss of removable drive

I get a lot of messages like the following in the syslog...
What are they and should I be concerned?
---
Mar 10 14:59:10 Tower kernel: FAT: Directory bread(block 510) failed
Mar 10 14:59:10 Tower kernel: FAT: Directory bread(block 511) failed

Usually when those errors appear, the system has lost contact with the flash drive.

  • It could be the USB port (loose or faulty)
    • Try re-seating the flash drive
    • Try connecting to a different USB port
  • It could be the flash drive is going bad
    • Test it on another machine
  • It could be a shared IRQ has been disabled, one that serviced this USB port
    • Check the syslog for evidence related to its IRQ
  • more to be added, as discovered


You will have to power off to get the system back, and most likely, it will want to start a parity check, because it cannot update the flash drive with a proper shutdown. Any settings changes won't be saved either, until the flash drive is accessible again.


Drive interface issue #1

   ata3.00: exception Emask 0x50 SAct 0x1 SErr 0x280900 action 0x6 frozen
   ata3.00: irq_stat 0x08000000, interface fatal error
   ata3: SError: { 10B8B BadCRC }   often also DisPar and UnrecovData and HostInt

Your machine seems to be suffering genuine link layer problem. In most cases, this indicates hardware problem and in my experience, common causes are (in the order of ballpark frequency)...

  1. inadequate power supply
  2. device and controller don't like each other on 3Gbps
  3. cable too long or flaky connector (especially with eSATA cables or genders or backplanes)
  4. faulty controller or drive

-- tejun


Drive interface issue #2

   res 40/00:00:48:19:67/00:00:1e:00:00/40 Emask 0x50 (ATA bus error)
   ata3: SError: { UnrecovData HostInt 10B8B BadCRC }

These are usually related to a bad cable or connector.


Drive interface issue #3

   ata2.00: exception Emask 0x10 SAct 0x7ff4f SErr 0x400100 action 0x6 frozen
   ata2.00: irq_stat 0x08000000, interface fatal error
   ata2: SError: { UnrecovData Handshk }

This is transmission error. Most common causes are power related or unreliable connection especially if backplanes are involved. Is the problem still reproducible? If so, can you please try to move it to different power connector and SATA port and see what changes? -- tejun


Drive media issue #1

(media error) and UNC

Bad sectors, needs to be confirmed by SMART report ...




Firmware upgrades

Warning!  highly disorganized and overlapping information below, copied from Internet sources

Seagate #1

"There are a few drives which are currently marked to disable NCQ and warn the user that the firmware that should be upgraded:"

  • ST31500341AS
  • ST31000333AS
  • ST3640623AS
  • ST3640323AS
  • ST3320813AS
  • ST3320613AS
  • all for firmware versions SD15 through SD19.

Firmware Update for ST31500341AS, ST31000333AS, ST3640323AS, ST3640623AS, ST3320613AS, ST3320813AS, ST3160813AS

Firmware Update for STM31000334AS, STM3640323AS, STM3320614AS, STM3160813AS

Firmware Update for ST3500320AS, ST3500620AS, ST3500820AS, ST3640330AS, ST3640530AS, ST3750330AS, ST3750630AS, ST31000340AS

Firmware Update for ST3250310NS, ST3500320NS, ST3750330NS, ST31000340NS

Seagate #2

Tech sites are reporting everywhere a massive flaw in seagate drives that can lock up the drive and make it unusable (the bios doesn't detect it, you can't read the data). Haven't read anything about it here on the lists. Seagate has ack'ed the problem:

So, apparently there're a lot of drives on the market (including mine) that can die any day. Are those drives going to be blacklisted? It's still not clear if the firmware update is safe (some affected but working drives are dying after the firmware update), so some people like me is still waiting (and hoping that the drive doesn't die) for more stable firmware updates...

Here is the list of drives+firmware affected, according to the support site as of now. Some models are still being diagnosed.


Seagate Barracuda 7200.11 (http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951)

Models Affected:

  • ST3500320AS
  • ST3640330AS
  • ST3750330AS
  • ST31000340AS

Firmware Affected

  • SD15, SD16, SD17, SD18, SD19, AD14

Recommended Firmware Update

  • SD1A

Seagate Barracuda 7200.11, page 2 (http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207957)

Models Affected:

  • ST31500341AS
  • ST31000333AS
  • ST3640323AS
  • ST3640623AS
  • ST3320613AS
  • ST3320813AS
  • ST3160813AS

Firmware Affected

  • Still Unknown

Recommended Firmware Update

  • Still Unknown

Seagate Barracuda ES.2 (http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207963)

Models Affected:

  • ST3250310NS
  • ST3500320NS
  • ST3750330NS
  • ST31000340NS

Firmware Affected

  • Still Unknown

Recommended Firmware Update

  • Still Unknown

DiamondMax 22 (http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207969)

Models Affected:

  • STM3500320AS
  • STM3750330AS
  • STM31000340AS

Firmware Affected

  • MX15 (or higher)

Recommended Firmware Update

  • MX1A

DiamondMax 22 (http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207975)

Models Affected:

  • STM31000334AS
  • STM3320614AS
  • STM3160813AS

Firmware Affected

  • Still Unknown

Recommended Firmware Update

  • Still Unknown