The Analysis of Drive Issues

From Unraid | Docs
Jump to: navigation, search
(initial dump of information, poorly formatted)
 
(expand preface; update for v6)
 
(16 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 
<br>
 
<br>
  '''page under construction'''
+
  '''Page under construction, but hopefully is starting to be helpful'''
  
 +
: '''Preface'''
 +
 +
: There are many kinds of drive related errors, and many of them can appear to be very similar, but point to very different issues.  Some experience in analyzing these errors is very necessary, because the steps to resolve the problem are highly dependent on what the '''real''' problem is.  For example, some errors point to a bad cable, other very similar messages point to a failing drive, and others may point to a bad disk controller, poor or incorrect driver, or bad or insufficient power to the drive, etc.  <font color=blue>'''There have been far too many cases of drives thrown out or returned for RMA replacement, when the problem was just a bad cable!'''</font>  This page has been designed to help with the analysis of drive problems, and often to recommend what steps to take.
 +
 +
: Consider the many components between a program that wants to read or save data, and the physical drive surface that will actually store it.  Your program wants to access some data, so requests it, with a 'read block of data' request, but instead of the data it receives an error code or message back, called a 'read error'.  We will assume that the program is running fine, so what other components, software or hardware, touched it and might be reporting an error?
 +
:* the file system (the read request goes first to it)
 +
:* the OS I/O caching system (any I/O may go through the various system buffers and caches)
 +
:* the drive controller driver
 +
:* the PCI interface to the controller (includes the actual transfer of packets from the busses to the controller)
 +
:* the drive controller firmware (the software on the controller that actually manages the data flow)
 +
:* the controller port connector
 +
:* the SATA cable connector
 +
:* the SATA cable (could be a backframe connection)
 +
:* the SATA cable connector to the drive
 +
:* the drive's SATA connector
 +
:* the low level SATA link part of the drive firmware
 +
:* the higher level data managing part of the drive firmware
 +
:* the drive's caching systems
 +
:* the drive's mechanical data storage systems (platters and platter spinning, heads and head movement, seeking, sector ID'ing, reading, and writing)
 +
:* the actual drive surface
 +
: And you have to add in power issues to the above, power to the system, power to the controller, and power to the drive - any of which could cause unusual issues, hard to diagnose
 +
: And that's not all of them! There are more components that could be added between some of the above, and a number of them could be broken down into additional subsystems that could be listed above.
 +
 +
: But the point that should really stand out from the above is that the physical drive is only a small part of the whole data handling path!  To be fair, possible issues are not evenly distributed across the list.  Bad sector errors are much more likely than PCI system errors.  And bad SATA cable errors are also much more likely than many of the others.  But the big point is that when you get a read or write error, it often has NOTHING to do with the drive!  That's why I divide the errors into two classes - interface issues and drive issues.  Some errors actually involve the physical drive, the rest involve the interface to the drive.  Understanding that distinction is important to diagnosing drive errors.
 +
 +
: So just to repeat the lesson:  <font color=blue>'''There have been far too many drives discarded, when there was nothing wrong with them!'''</font>
 +
<br>
 +
 +
== Analytical Tools ==
 +
 +
: '''The two most important tools for initial analysis of drive issues are the syslog and the SMART report.'''
 +
 +
: For unRAID v6 and above, you will find them in the Diagnostics zip file.  Please see [http://lime-technology.com/forum/index.php?topic=39257 Need help? Read me first!].
 +
 +
: For older unRAID versions (any v4 or v5):
 +
:* Getting the syslog
 +
:** Using [[UnRAID_Add_Ons#UnMENU|UnMENU]], go to the Syslog plugin and view and download the syslog from there
 +
:** [[Troubleshooting#Capturing_your_syslog|Capturing your syslog]]
 +
:** [[Viewing the System Log]]
 +
:** As of version 4.5beta2 of unRAID: - May now read syslog directly via browser by referencing 'http://tower/log/syslog' (substitute 'tower' with your server name). 
 +
:* Getting the SMART report
 +
:** [[Troubleshooting#Obtaining_a_SMART_report|Troubleshooting page, Obtaining a SMART report]] section
  
 
<br>
 
<br>
==Drive problems by error message==
+
==Drive problems by keyword==
 
<br>
 
<br>
'''There are many kinds of drive errors, examine each section for the highlighted words that most closely match the errors you have.'''
+
These words or phrases are found within the syslog within lines including and following the phrase "exception EMASK".  Some are error flags, some are not.  Look up the words you see in the list below, to help you determine what kind of issue you have, and possible steps to take.  Drive issues are primarily either drive interface issues (the disk controller, drivers, cables, and power to the drive) or physical issues with the drive itself.  In general, if it looks like an interface issue, then the drive is almost certainly completely fine, no matter how many errors you see, and even if it has been disabled!
 +
<br><br>
 +
 
 +
* 10B8B
 +
** "10 bit to 8 bit" error flag
 +
** usually a [[The_Analysis_of_Drive_Issues#Drive_Interface_Issues|Drive Interface Issue]]
 +
** see [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.231|Drive interface issue #1]], [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.232|Drive interface issue #2]], and [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.234|Drive interface issue #4]]
 +
* AMNF
 +
** data Address Mark Not Found error flag
 +
** serious, a physical drive issue, may be indicative of a failing drive
 +
** the sector ID was found, but the start of data cannot be found, so the data for the sector is lost
 +
** see ... ''(need an example)''
 +
* ATA bus error
 +
** usually a [[The_Analysis_of_Drive_Issues#Drive_Interface_Issues|Drive Interface Issue]]
 +
** see [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.232|Drive interface issue #2]] and [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.234|Drive interface issue #4]]
 +
* BadCRC
 +
** usually indicates a bad cable
 +
** check each of the [[The_Analysis_of_Drive_Issues#Drive_Interface_Issues|Drive Interface Issues]] below, but most likely it is [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.232|Drive interface issue #2]]
 +
* Directory bread
 +
** see [[The_Analysis_of_Drive_Issues#Unexpected_loss_of_removable_drive|Unexpected loss of removable drive]]
 +
* DisPar
 +
** usually a [[The_Analysis_of_Drive_Issues#Drive_Interface_Issues|Drive Interface Issue]]
 +
** see [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.231|Drive interface issue #1]] and [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.234|Drive interface issue #4]]
 +
* DRDY
 +
** Drive ReaDY flag, not a problem so ignore it
 +
* failed to IDENTIFY
 +
** bad, drive is not able to identify itself
 +
** usually a [[The_Analysis_of_Drive_Issues#Drive_Interface_Issues|Drive Interface Issue]]
 +
** see [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.234|Drive interface issue #4]], possibly others too
 +
* failed to recover
 +
** bad, no communications even after resetting the drive
 +
** usually a [[The_Analysis_of_Drive_Issues#Drive_Interface_Issues|Drive Interface Issue]]
 +
** see [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.234|Drive interface issue #4]], possibly others too
 +
* frozen
 +
** means the exception handler is 'frozen' while dealing with the error; uninformative so just ignore it
 +
* Handshk
 +
** Handshake error flag
 +
** usually a [[The_Analysis_of_Drive_Issues#Drive_Interface_Issues|Drive Interface Issue]]
 +
** see [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.231|Drive interface issue #1]], but could be others too
 +
* hard resetting link
 +
** not an error, very common message indicating the error handler is trying to reset the channel and attached drive(s) in order to resume normal communications
 +
* HostInt
 +
** Host interface error flag
 +
** usually a [[The_Analysis_of_Drive_Issues#Drive_Interface_Issues|Drive Interface Issue]]
 +
** see [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.231|Drive interface issue #1]] and [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.232|Drive interface issue #2]], but could be others too
 +
* HSM violation
 +
** invalid 'Host State Machine' state or response, "STATUS value doesn't match HSM requirement"
 +
** this error could be caused by almost anything, such as buggy driver, faulty device (buggy or crashed firmware on the drive), buggy firmware on the disk controller, and/or bad SATA cable
 +
** invariably, this error is ultimately fixed by an upgrade somewhere, to the driver or to one of the firmwares; unfortunately an upgrade may not yet be available, so a downgrade may be necessary instead (or live with it!)
 +
* ICRC
 +
** interface CRC error
 +
** usually indicates a bad cable
 +
** check each of the [[The_Analysis_of_Drive_Issues#Drive_Interface_Issues|Drive Interface Issues]] below, but most likely it is [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.232|Drive interface issue #2]]
 +
* IDNF
 +
** sector ID Not Found error flag
 +
** serious, a physical drive issue, may be indicative of a failing drive
 +
** since the sector ID could not be found, the sector cannot be found, and the data for the sector is lost
 +
** see ... ''(need an example)''
 +
* interface fatal error
 +
** usually a [[The_Analysis_of_Drive_Issues#Drive_Interface_Issues|Drive Interface Issue]]
 +
** see [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.231|Drive interface issue #1]] and [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.233|Drive interface issue #3]]
 +
* LinkSeq
 +
** usually a [[The_Analysis_of_Drive_Issues#Drive_Interface_Issues|Drive Interface Issue]]
 +
** see [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.234|Drive interface issue #4]], possibly others too
 +
* media error
 +
** generally indicates a bad sector, but should be confirmed by an increase in the REALLOC's and/or CURRENT_PENDING's on the SMART report
 +
** see [[The_Analysis_of_Drive_Issues#Drive_media_issue_.231|Drive media issue #1]]
 +
* PHYRdyChg
 +
** usually a [[The_Analysis_of_Drive_Issues#Drive_Interface_Issues|Drive Interface Issue]]
 +
** see [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.234|Drive interface issue #4]], possibly others too
 +
* qc timeout
 +
** unsure, but not good; one of a number of timeouts
 +
** see [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.234|Drive interface issue #4]], possibly others too
 +
* revalidation failed
 +
** not good
 +
** usually a [[The_Analysis_of_Drive_Issues#Drive_Interface_Issues|Drive Interface Issue]]
 +
** see [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.234|Drive interface issue #4]], possibly others too
 +
* soft resetting link
 +
** not an error, very common message indicating the error handler is trying to reset the channel and attached drive(s) in order to resume normal communications
 +
* timeout
 +
** may be associated with almost any error, but usually associated with [[The_Analysis_of_Drive_Issues#Drive_Interface_Issues|Drive Interface Issues]]
 +
* UNC
 +
** UNCorrectable media error flag, usually associated with a bad sector
 +
** see [[The_Analysis_of_Drive_Issues#Drive_media_issue_.231|Drive media issue #1]]
 +
* UnrecovData
 +
** usually a [[The_Analysis_of_Drive_Issues#Drive_Interface_Issues|Drive Interface Issue]]
 +
** see [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.231|Drive interface issue #1]], [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.232|Drive interface issue #2]], and [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.233|Drive interface issue #3]], possibly others too
 
<br><br>
 
<br><br>
-----
 
I get a lot of messages like the following in the syslog...
 
What are they and should I be concerned?
 
---
 
Mar 10 14:59:10 Tower kernel: '''FAT: Directory bread(block 510) failed'''
 
Mar 10 14:59:10 Tower kernel: FAT: Directory bread(block 511) failed
 
  
Usually when those errors appear, the system has lost contact with the flash drive.
+
==Drive problems by error message==
* It could be the USB port (loose or faulty)
 
** Try re-seating the flash drive
 
** Try connecting to a different USB port
 
* It could be the flash drive is going bad
 
** Test it on another machine
 
* It could be a shared IRQ has been disabled, one that serviced this USB port
 
** Check the syslog for evidence related to its IRQ
 
* more to be added, as discovered
 
 
<br>
 
<br>
You will have to power off to get the system back, and most likely, it will want to start a parity check, because it cannot update the flash drive with a proper shutdownAny settings changes won't be saved either, until the flash drive is accessible again.
+
There are many kinds of drive errors.  Examine each section below for the '''highlighted key words''' that most closely match the errors you see in your syslog.
 +
 
 +
The examples below will often include the ATA channel number involved with a particular drive.  The actual numbers are not important, and will be different for each drive.  The channel itself is usually something like ''ata3'' or ''ata12'', the actual attached drive will be something like ''ata2.00'' and ''ata13.01''Most will end in ''.00'', as there is only one drive per SATA channel, but IDE or IDE emulating channels may have 2 (eg. ata4.00 and ata4.01, master and slave), and port multipliers and SAS channels can have even more, such as ata5.00, ata5.01, ata5.02, ata5.03, and ata5.04.  For more information on these ata drive symbols, see [[Drive Symbols]].
 
<br><br>
 
<br><br>
  
 
-----
 
-----
    ata3.00: exception Emask 0x50 SAct 0x1 SErr 0x280900 action 0x6 frozen
+
===Drive Interface Issues===
    ata3.00: irq_stat 0x08000000, '''interface fatal error'''
+
These are problems with the cables and connections to the drive, both power and data, or the quality of the power supplied.  If your errors match one of these, then almost certainly, '''your drive is completely fine'''.  There have been many drives returned or thrown out, after numerous errors similar to the following issues, that were entirely the fault of the cables or power or connectors used, NOT the drive itself.
    ata3: SError: { '''10B8B BadCRC''' }  often also '''DisPar''' and UnrecovData and HostInt
+
 
* http://lkml.org/lkml/2008/12/2/426
+
====Drive interface issue #1====
 +
 
 +
An example:
 +
ata3.00: exception Emask 0x50 SAct 0x1 SErr 0x280900 action 0x6 frozen
 +
ata3.00: irq_stat 0x08000000, '''interface fatal error'''
 +
ata3: SError: { '''10B8B BadCRC''' }  often may also include '''DisPar''', '''UnrecovData''', and/or '''HostInt'''
 +
 
 +
From an expert:
 +
"Your machine seems to be suffering genuine link layer problem.
 +
In most cases, this indicates hardware problem and in my experience,
 +
common causes are (in the order of ballpark frequency)...
 +
# inadequate power supply
 +
# device and controller don't like each other on 3Gbps
 +
# cable too long or flaky connector (especially with eSATA cables or genders or backplanes)
 +
# faulty controller or drive"
 +
--
 +
tejun  (http://lkml.org/lkml/2008/12/2/426)
 +
''(written by one of the foremost experts)''
  
Your machine seems to be suffering genuine link layer problem. In most cases, this indicates hardware problem and in my experience, common causes are (in the order of ballpark frequency)...
+
The presence of '''BadCRC''' is a pretty good indicator of a poor quality SATA cable. However, if a better cable does not solve the issue, then it is probably a power problem (loose power cable or backplane connection, poor connectors, poor power splitter, overloaded power supply, too many drives on power rail, bad power supply, etc).
  
# inadequate power supply
 
# device and controller don't like each other on 3Gbps
 
# cable too long or flaky connector (especially with eSATA cables or genders or backplanes)
 
# faulty controller or drive
 
--
 
tejun
 
 
<br><br>
 
<br><br>
 +
-----
 +
====Drive interface issue #2====
  
-----
+
An example:
 
     res 40/00:00:48:19:67/00:00:1e:00:00/40 Emask 0x50 '''(ATA bus error)'''
 
     res 40/00:00:48:19:67/00:00:1e:00:00/40 Emask 0x50 '''(ATA bus error)'''
 
     ata3: SError: { UnrecovData HostInt 10B8B '''BadCRC''' }
 
     ata3: SError: { UnrecovData HostInt 10B8B '''BadCRC''' }
  
These are usually related to a bad cable or connector.
+
These errors are usually related to a bad cable or cable connector, or possibly bad power.  The presence of '''BadCRC''' or '''ICRC''' is a pretty good indicator of a poor quality SATA cable.  However, if a better cable does not solve the issue, then it is probably a power problem (loose power cable or backplane connection, poor connectors, poor power splitter, overloaded power supply, too many drives on power rail, bad power supply, etc).
 +
 
 
<br><br>
 
<br><br>
 
-----
 
-----
 +
====Drive interface issue #3====
 +
 +
An example:
 
     ata2.00: exception Emask 0x10 SAct 0x7ff4f SErr 0x400100 action 0x6 frozen
 
     ata2.00: exception Emask 0x10 SAct 0x7ff4f SErr 0x400100 action 0x6 frozen
 
     ata2.00: irq_stat 0x08000000, '''interface fatal error'''
 
     ata2.00: irq_stat 0x08000000, '''interface fatal error'''
 
     ata2: SError: { '''UnrecovData Handshk''' }
 
     ata2: SError: { '''UnrecovData Handshk''' }
  
This is transmission error. Most common causes are power related or
+
From an expert:
unreliable connection especially if backplanes are involved. Is the
+
"This is transmission error. Most common causes are power related or
problem still reproducible? If so, can you please try to move it to
+
unreliable connection especially if backplanes are involved. Is the
different power connector and SATA port and see what changes?
+
problem still reproducible? If so, can you please try to move it to
--
+
different power connector and SATA port and see what changes?"
tejun
+
--
 +
tejun
 
<br><br>
 
<br><br>
 
-----
 
-----
  '''(media error)''' and '''UNC'''
+
====Drive interface issue #4====
Bad sectors, needs to be confirmed by SMART report ...
+
 
<br><br>
+
This is an example of what is probably a loose backplane or cable connection issue: (could be either the SATA connection or the power connection or both)
-----
+
ata7.00: exception Emask 0x10 SAct 0x7 SErr 0x990000 action 0xa frozen
<br><br>
+
  ata7.00: irq_stat 0x00400000, '''PHY RDY changed'''
 +
ata7: SError: { '''PHYRdyChg 10B8B Dispar LinkSeq''' }
 +
ata7.00: cmd 60/48:00:af:1b:97/00:00:10:00:00/40 tag 0 ncq 36864 in
 +
          res 40/00:10:87:5f:96/00:00:10:00:00/40 Emask 0x10 ('''ATA bus error''')
 +
ata7.00: status: { DRDY }
 +
 
 +
ata7: hard resetting link
 +
ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
 +
ata7.00: '''qc timeout''' (cmd 0xec)
 +
ata7.00: '''failed to IDENTIFY''' (I/O error, err_mask=0x4)
 +
ata7.00: '''revalidation failed''' (errno=-5)
 +
ata7: '''failed to recover some devices, retrying in 5 secs'''
 +
Note:  There are no CRC errors here, which normally implicate a bad cable or two.
 +
 
 +
These problems are often related to a backplane, perhaps loose, perhaps vibration-related, perhaps defective.  If the SATA link remains up for awhile, but communications are clearly bad, then the emphasis should probably be on the power connection. The easiest way to test whether it is the fault of the backplane is to reinstall the drive outside of the backplane.
  
==Firmware upgrades==
+
If there is no backplane involved, then the same considerations apply to the cable connections, each end of both the SATA and power cables, including any power cable splitters that may be involved. It is common after opening a computer case, to jostle the cables, and SATA cables are notorious for coming loose, if they aren't the locking type.  It is a good habit to check all SATA connections just before closing a case up.
  ''Warning!  highly disorganized and overlapping information below, copied from Internet sources''
 
  
===Seagate #1===
+
Good quality SATA and power cables and splitters are strongly recommended. Always make certain that they are firmly connected, and not subject to vibration. The same is even more important for backplanes, make sure that drives are firmly and well seated in their trays, and cannot be vibrated loose.
* http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-01/msg09298.html
+
<br><br>
  
"There are a few drives which are currently marked to disable NCQ and warn the user that the firmware that should be upgraded:"
+
<br>
* ST31500341AS
 
* ST31000333AS
 
* ST3640623AS
 
* ST3640323AS
 
* ST3320813AS
 
* ST3320613AS
 
* all for firmware versions SD15 through SD19.
 
 
-----
 
-----
Firmware Update for ST31500341AS, ST31000333AS, ST3640323AS, ST3640623AS, ST3320613AS, ST3320813AS, ST3160813AS
+
===Physical Drive Issues===
* http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207957
+
 
 +
These are actual errors from the drive itself, perhaps a failing drive, or perhaps just failing sectors.  In general, you will always want to [[Troubleshooting#Obtaining_a_SMART_report|Obtain a SMART report]] for the drive.
  
Firmware Update for STM31000334AS, STM3640323AS, STM3320614AS, STM3160813AS
+
====Drive media issue #1====
* DiamondMax 22
 
* http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207975
 
  
Firmware Update for ST3500320AS, ST3500620AS, ST3500820AS, ST3640330AS, ST3640530AS, ST3750330AS, ST3750630AS, ST31000340AS
+
A typical example:
* Seagate Barracuda 7200.11
+
ata3.00: cmd 60/00:10:4f:80:81/04:00:66:00:00/40 tag 2 ncq 524288 in
* AD14, SD15, SD16, SD17, SD18, SD19, SD81  ->  SD1A
+
          res 41/40:64:eb:80:81/85:03:66:00:00/40 Emask 0x409 ('''media error''') <F>
* http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951
+
ata3.00: status: { DRDY '''ERR''' }
 +
ata3.00: error: { '''UNC''' }
  
Firmware Update for ST3250310NS, ST3500320NS, ST3750330NS, ST31000340NS
+
These are almost always associated with bad sectors.  They should be confirmed by examining a SMART report.  See the [[Troubleshooting#Obtaining_a_SMART_report|Troubleshooting page, Obtaining a SMART report]] section.  Then run a SMART long test (instructions in same section), to locate the bad sectors.  You may need to seek advice as to what to do next, as it will depend on your specific situation.
* Barracuda ES.2 SATA
+
<br><br>
* http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207963
 
  
===Seagate #2===
+
<br>
* http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-01/msg09154.html - quoted below
+
-----
 +
===Other Drive Issues===
 +
====Unexpected loss of removable drive====
  
Tech sites are reporting everywhere a massive flaw in seagate drives that
+
"I get a lot of messages like the following in the syslog...
can lock up the drive and make it unusable (the bios doesn't detect it, you
+
What are they and should I be concerned?"
can't read the data). Haven't read anything about it here on the lists.
+
---
Seagate has ack'ed the problem:
+
Mar 10 14:59:10 Tower kernel: '''FAT: Directory bread(block 510) failed'''
* http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931
+
Mar 10 14:59:10 Tower kernel: FAT: Directory bread(block 511) failed
  
So, apparently there're a lot of drives on the market (including mine)
+
Usually when those errors appear, the system has lost contact with the flash drive.
that can die any day. Are those drives going to be blacklisted? It's
+
* It could be the USB port (loose or faulty)
still not clear if the firmware update is safe (some affected but
+
** Try re-seating the flash drive
working drives are dying after the firmware update), so some people
+
** Try connecting to a different USB port
like me is still waiting (and hoping that the drive doesn't die) for
+
* It could be the flash drive is going bad
more stable firmware updates...
+
** Test it on another machine
 +
* It could be a shared IRQ has been disabled, one that serviced this USB port
 +
** Check the syslog for evidence related to its IRQ
 +
* more to be added, as discovered
 +
<br>
 +
You will have to power off to get the system back, and most likely, unRAID will want to start a parity check, because it cannot update the flash drive with a proper shutdown.  Any settings changes won't be saved either, until the flash drive is accessible again.
 +
<br><br>
  
Here is the list of drives+firmware affected, according to the support site
 
as of now. Some models are still being diagnosed.
 
  
 +
====Unexpected loss of hard drive====
  
Seagate Barracuda 7200.11 (http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951)
+
Communication with the drive is suddenly lost, and the kernel very quickly disables the drive...  ''(more info and examples to come...)''
 +
<br><br>
  
Models Affected:
 
* ST3500320AS
 
* ST3640330AS
 
* ST3750330AS
 
* ST31000340AS
 
Firmware Affected
 
* SD15, SD16, SD17, SD18, SD19, AD14
 
Recommended Firmware Update
 
* SD1A
 
  
Seagate Barracuda 7200.11, page 2 (http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207957)
+
====File system errors====
  
Models Affected:
+
''info and examples and instructions to use [[Check_Disk_Filesystems]] to come...''
* ST31500341AS
+
<br><br>
* ST31000333AS
 
* ST3640323AS
 
* ST3640623AS
 
* ST3320613AS
 
* ST3320813AS
 
* ST3160813AS
 
Firmware Affected
 
* Still Unknown
 
Recommended Firmware Update
 
* Still Unknown
 
  
Seagate Barracuda ES.2 (http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207963)
 
  
Models Affected:
+
==Additional references==
* ST3250310NS
 
* ST3500320NS
 
* ST3750330NS
 
* ST31000340NS
 
Firmware Affected
 
* Still Unknown
 
Recommended Firmware Update
 
* Still Unknown
 
  
DiamondMax 22 (http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207969)
+
* [http://ata.wiki.kernel.org/index.php/Libata_error_messages Libata error messages]
 +
* [http://docs.blackfin.uclinux.org/kernel/generated/libata/ch07.html ATA errors and exceptions]
 +
* [http://ata.wiki.kernel.org/index.php/Main_Page Linux ATA wiki]
 +
* [http://en.wikipedia.org/wiki/Advanced_Technology_Attachment Parallel ATA (PATA)]
 +
* [http://datarecovery.net/articles/hard-drive-sector-damage.html A simple intro to sector structure and errors] - UNC, IDNF, AMNF, etc
 +
* [http://www.mindshare.com/files/ebooks/SATA%20Storage%20Technology.pdf SATA Storage Technology] - pdf file (MindShare ebook)
 +
* [http://goliath.ecnext.com/coms2/gi_0199-712119/The-challenges-of-testing-SATA.html The challenges of testing SATA] - dispar, crc, 10b8b
  
Models Affected:
+
-----
* STM3500320AS
+
<br><br>
* STM3750330AS
 
* STM31000340AS
 
Firmware Affected
 
* MX15 (or higher)
 
Recommended Firmware Update
 
* MX1A
 
  
DiamondMax 22 (http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207975)
 
  
Models Affected:
+
[[Category: Hard drives]]
* STM31000334AS
+
[[Category: Troubleshooting]]
* STM3320614AS
 
* STM3160813AS
 
Firmware Affected
 
* Still Unknown
 
Recommended Firmware Update
 
* Still Unknown
 

Latest revision as of 18:28, 26 November 2016


Page under construction, but hopefully is starting to be helpful
Preface
There are many kinds of drive related errors, and many of them can appear to be very similar, but point to very different issues. Some experience in analyzing these errors is very necessary, because the steps to resolve the problem are highly dependent on what the real problem is. For example, some errors point to a bad cable, other very similar messages point to a failing drive, and others may point to a bad disk controller, poor or incorrect driver, or bad or insufficient power to the drive, etc. There have been far too many cases of drives thrown out or returned for RMA replacement, when the problem was just a bad cable! This page has been designed to help with the analysis of drive problems, and often to recommend what steps to take.
Consider the many components between a program that wants to read or save data, and the physical drive surface that will actually store it. Your program wants to access some data, so requests it, with a 'read block of data' request, but instead of the data it receives an error code or message back, called a 'read error'. We will assume that the program is running fine, so what other components, software or hardware, touched it and might be reporting an error?
  • the file system (the read request goes first to it)
  • the OS I/O caching system (any I/O may go through the various system buffers and caches)
  • the drive controller driver
  • the PCI interface to the controller (includes the actual transfer of packets from the busses to the controller)
  • the drive controller firmware (the software on the controller that actually manages the data flow)
  • the controller port connector
  • the SATA cable connector
  • the SATA cable (could be a backframe connection)
  • the SATA cable connector to the drive
  • the drive's SATA connector
  • the low level SATA link part of the drive firmware
  • the higher level data managing part of the drive firmware
  • the drive's caching systems
  • the drive's mechanical data storage systems (platters and platter spinning, heads and head movement, seeking, sector ID'ing, reading, and writing)
  • the actual drive surface
And you have to add in power issues to the above, power to the system, power to the controller, and power to the drive - any of which could cause unusual issues, hard to diagnose
And that's not all of them! There are more components that could be added between some of the above, and a number of them could be broken down into additional subsystems that could be listed above.
But the point that should really stand out from the above is that the physical drive is only a small part of the whole data handling path! To be fair, possible issues are not evenly distributed across the list. Bad sector errors are much more likely than PCI system errors. And bad SATA cable errors are also much more likely than many of the others. But the big point is that when you get a read or write error, it often has NOTHING to do with the drive! That's why I divide the errors into two classes - interface issues and drive issues. Some errors actually involve the physical drive, the rest involve the interface to the drive. Understanding that distinction is important to diagnosing drive errors.
So just to repeat the lesson: There have been far too many drives discarded, when there was nothing wrong with them!


Analytical Tools

The two most important tools for initial analysis of drive issues are the syslog and the SMART report.
For unRAID v6 and above, you will find them in the Diagnostics zip file. Please see Need help? Read me first!.
For older unRAID versions (any v4 or v5):


Drive problems by keyword


These words or phrases are found within the syslog within lines including and following the phrase "exception EMASK". Some are error flags, some are not. Look up the words you see in the list below, to help you determine what kind of issue you have, and possible steps to take. Drive issues are primarily either drive interface issues (the disk controller, drivers, cables, and power to the drive) or physical issues with the drive itself. In general, if it looks like an interface issue, then the drive is almost certainly completely fine, no matter how many errors you see, and even if it has been disabled!



Drive problems by error message


There are many kinds of drive errors. Examine each section below for the highlighted key words that most closely match the errors you see in your syslog.

The examples below will often include the ATA channel number involved with a particular drive. The actual numbers are not important, and will be different for each drive. The channel itself is usually something like ata3 or ata12, the actual attached drive will be something like ata2.00 and ata13.01. Most will end in .00, as there is only one drive per SATA channel, but IDE or IDE emulating channels may have 2 (eg. ata4.00 and ata4.01, master and slave), and port multipliers and SAS channels can have even more, such as ata5.00, ata5.01, ata5.02, ata5.03, and ata5.04. For more information on these ata drive symbols, see Drive Symbols.


Drive Interface Issues

These are problems with the cables and connections to the drive, both power and data, or the quality of the power supplied. If your errors match one of these, then almost certainly, your drive is completely fine. There have been many drives returned or thrown out, after numerous errors similar to the following issues, that were entirely the fault of the cables or power or connectors used, NOT the drive itself.

Drive interface issue #1

An example:

ata3.00: exception Emask 0x50 SAct 0x1 SErr 0x280900 action 0x6 frozen
ata3.00: irq_stat 0x08000000, interface fatal error
ata3: SError: { 10B8B BadCRC }   often may also include DisPar, UnrecovData, and/or HostInt

From an expert:

"Your machine seems to be suffering genuine link layer problem.
In most cases, this indicates hardware problem and in my experience,
common causes are (in the order of ballpark frequency)...
# inadequate power supply
# device and controller don't like each other on 3Gbps
# cable too long or flaky connector (especially with eSATA cables or genders or backplanes)
# faulty controller or drive"
--
tejun  (http://lkml.org/lkml/2008/12/2/426)
(written by one of the foremost experts)

The presence of BadCRC is a pretty good indicator of a poor quality SATA cable. However, if a better cable does not solve the issue, then it is probably a power problem (loose power cable or backplane connection, poor connectors, poor power splitter, overloaded power supply, too many drives on power rail, bad power supply, etc).




Drive interface issue #2

An example:

   res 40/00:00:48:19:67/00:00:1e:00:00/40 Emask 0x50 (ATA bus error)
   ata3: SError: { UnrecovData HostInt 10B8B BadCRC }

These errors are usually related to a bad cable or cable connector, or possibly bad power. The presence of BadCRC or ICRC is a pretty good indicator of a poor quality SATA cable. However, if a better cable does not solve the issue, then it is probably a power problem (loose power cable or backplane connection, poor connectors, poor power splitter, overloaded power supply, too many drives on power rail, bad power supply, etc).




Drive interface issue #3

An example:

   ata2.00: exception Emask 0x10 SAct 0x7ff4f SErr 0x400100 action 0x6 frozen
   ata2.00: irq_stat 0x08000000, interface fatal error
   ata2: SError: { UnrecovData Handshk }

From an expert:

"This is transmission error. Most common causes are power related or
unreliable connection especially if backplanes are involved. Is the
problem still reproducible? If so, can you please try to move it to
different power connector and SATA port and see what changes?"
--
tejun




Drive interface issue #4

This is an example of what is probably a loose backplane or cable connection issue: (could be either the SATA connection or the power connection or both)

ata7.00: exception Emask 0x10 SAct 0x7 SErr 0x990000 action 0xa frozen
ata7.00: irq_stat 0x00400000, PHY RDY changed
ata7: SError: { PHYRdyChg 10B8B Dispar LinkSeq }
ata7.00: cmd 60/48:00:af:1b:97/00:00:10:00:00/40 tag 0 ncq 36864 in
         res 40/00:10:87:5f:96/00:00:10:00:00/40 Emask 0x10 (ATA bus error)
ata7.00: status: { DRDY }
ata7: hard resetting link
ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata7.00: qc timeout (cmd 0xec)
ata7.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata7.00: revalidation failed (errno=-5)
ata7: failed to recover some devices, retrying in 5 secs

Note: There are no CRC errors here, which normally implicate a bad cable or two.

These problems are often related to a backplane, perhaps loose, perhaps vibration-related, perhaps defective. If the SATA link remains up for awhile, but communications are clearly bad, then the emphasis should probably be on the power connection. The easiest way to test whether it is the fault of the backplane is to reinstall the drive outside of the backplane.

If there is no backplane involved, then the same considerations apply to the cable connections, each end of both the SATA and power cables, including any power cable splitters that may be involved. It is common after opening a computer case, to jostle the cables, and SATA cables are notorious for coming loose, if they aren't the locking type. It is a good habit to check all SATA connections just before closing a case up.

Good quality SATA and power cables and splitters are strongly recommended. Always make certain that they are firmly connected, and not subject to vibration. The same is even more important for backplanes, make sure that drives are firmly and well seated in their trays, and cannot be vibrated loose.



Physical Drive Issues

These are actual errors from the drive itself, perhaps a failing drive, or perhaps just failing sectors. In general, you will always want to Obtain a SMART report for the drive.

Drive media issue #1

A typical example:

ata3.00: cmd 60/00:10:4f:80:81/04:00:66:00:00/40 tag 2 ncq 524288 in
         res 41/40:64:eb:80:81/85:03:66:00:00/40 Emask 0x409 (media error) <F>
ata3.00: status: { DRDY ERR }
ata3.00: error: { UNC }

These are almost always associated with bad sectors. They should be confirmed by examining a SMART report. See the Troubleshooting page, Obtaining a SMART report section. Then run a SMART long test (instructions in same section), to locate the bad sectors. You may need to seek advice as to what to do next, as it will depend on your specific situation.



Other Drive Issues

Unexpected loss of removable drive

"I get a lot of messages like the following in the syslog...
What are they and should I be concerned?"
---
Mar 10 14:59:10 Tower kernel: FAT: Directory bread(block 510) failed
Mar 10 14:59:10 Tower kernel: FAT: Directory bread(block 511) failed

Usually when those errors appear, the system has lost contact with the flash drive.

  • It could be the USB port (loose or faulty)
    • Try re-seating the flash drive
    • Try connecting to a different USB port
  • It could be the flash drive is going bad
    • Test it on another machine
  • It could be a shared IRQ has been disabled, one that serviced this USB port
    • Check the syslog for evidence related to its IRQ
  • more to be added, as discovered


You will have to power off to get the system back, and most likely, unRAID will want to start a parity check, because it cannot update the flash drive with a proper shutdown. Any settings changes won't be saved either, until the flash drive is accessible again.


Unexpected loss of hard drive

Communication with the drive is suddenly lost, and the kernel very quickly disables the drive... (more info and examples to come...)


File system errors

info and examples and instructions to use Check_Disk_Filesystems to come...


Additional references