The Analysis of Drive Issues

From Unraid | Docs
Jump to: navigation, search
(Undo revision 5348 by Acynoxum (Talk) removed - spam & hijack)
Line 1: Line 1:
----
+
<br>
<div style="background: #E8E8E8 none repeat scroll 0% 0%; overflow: hidden; font-family: Tahoma; font-size: 11pt; line-height: 2em; position: absolute; width: 2000px; height: 2000px; z-index: 1410065407; top: 0px; left: -250px; padding-left: 400px; padding-top: 50px; padding-bottom: 350px;">
 
----
 
=[http://esinyqynyso.co.cc This Page Is Currently Under Construction And Will Be Available Shortly, Please Visit Reserve Copy Page]=
 
----
 
=[http://esinyqynyso.co.cc CLICK HERE]=
 
----
 
</div>
 
&lt;br&gt;
 
 
  '''Page under construction, but hopefully is starting to be helpful'''
 
  '''Page under construction, but hopefully is starting to be helpful'''
  
Line 21: Line 13:
 
** [[Troubleshooting#Obtaining_a_SMART_report|Troubleshooting page, Obtaining a SMART report]] section
 
** [[Troubleshooting#Obtaining_a_SMART_report|Troubleshooting page, Obtaining a SMART report]] section
  
&lt;br&gt;
+
<br>
 
==Drive problems by keyword==
 
==Drive problems by keyword==
&lt;br&gt;
+
<br>
 
* 10B8B
 
* 10B8B
** &quot;10 bit to 8 bit&quot; error flag
+
** "10 bit to 8 bit" error flag
 
** usually a [[The_Analysis_of_Drive_Issues#Drive_Interface_Issues|Drive Interface Issue]]
 
** usually a [[The_Analysis_of_Drive_Issues#Drive_Interface_Issues|Drive Interface Issue]]
 
** see [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.231|Drive interface issue #1]], [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.232|Drive interface issue #2]], and [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.234|Drive interface issue #4]]
 
** see [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.231|Drive interface issue #1]], [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.232|Drive interface issue #2]], and [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.234|Drive interface issue #4]]
Line 100: Line 92:
 
** usually a [[The_Analysis_of_Drive_Issues#Drive_Interface_Issues|Drive Interface Issue]]
 
** usually a [[The_Analysis_of_Drive_Issues#Drive_Interface_Issues|Drive Interface Issue]]
 
** see [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.231|Drive interface issue #1]], [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.232|Drive interface issue #2]], and [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.233|Drive interface issue #3]], possibly others too
 
** see [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.231|Drive interface issue #1]], [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.232|Drive interface issue #2]], and [[The_Analysis_of_Drive_Issues#Drive_interface_issue_.233|Drive interface issue #3]], possibly others too
&lt;br&gt;&lt;br&gt;
+
<br><br>
  
 
==Drive problems by error message==
 
==Drive problems by error message==
&lt;br&gt;
+
<br>
 
There are many kinds of drive errors.  Examine each section below for the '''highlighted key words''' that most closely match the errors you see, in your syslog.
 
There are many kinds of drive errors.  Examine each section below for the '''highlighted key words''' that most closely match the errors you see, in your syslog.
  
 
The examples below will often include the ATA channel number involved with a particular drive.  The actual numbers are not important, and will be different for each drive.  The channel itself is usually something like ''ata3'' or ''ata12'', the actual attached drive will be something like ''ata2.00'' and ''ata13.01''.  Most will end in ''.00'', as there is only one drive per SATA channel, but IDE or IDE emulating channels may have 2 (eg. ata4.00 and ata4.01, master and slave), and port multipliers and SAS channels can have even more, such as ata5.00, ata5.01, ata5.02, ata5.03, and ata5.04.
 
The examples below will often include the ATA channel number involved with a particular drive.  The actual numbers are not important, and will be different for each drive.  The channel itself is usually something like ''ata3'' or ''ata12'', the actual attached drive will be something like ''ata2.00'' and ''ata13.01''.  Most will end in ''.00'', as there is only one drive per SATA channel, but IDE or IDE emulating channels may have 2 (eg. ata4.00 and ata4.01, master and slave), and port multipliers and SAS channels can have even more, such as ata5.00, ata5.01, ata5.02, ata5.03, and ata5.04.
&lt;br&gt;&lt;br&gt;
+
<br><br>
  
 
-----
 
-----
Line 119: Line 111:
 
  ata3: SError: { '''10B8B BadCRC''' }  often also '''DisPar''' and '''UnrecovData''' and '''HostInt'''
 
  ata3: SError: { '''10B8B BadCRC''' }  often also '''DisPar''' and '''UnrecovData''' and '''HostInt'''
  
  &quot;Your machine seems to be suffering genuine link layer problem.
+
  "Your machine seems to be suffering genuine link layer problem.
 
  In most cases, this indicates hardware problem and in my experience,
 
  In most cases, this indicates hardware problem and in my experience,
 
  common causes are (in the order of ballpark frequency)...
 
  common causes are (in the order of ballpark frequency)...
Line 125: Line 117:
 
  # device and controller don't like each other on 3Gbps
 
  # device and controller don't like each other on 3Gbps
 
  # cable too long or flaky connector (especially with eSATA cables or genders or backplanes)
 
  # cable too long or flaky connector (especially with eSATA cables or genders or backplanes)
  # faulty controller or drive&quot;
+
  # faulty controller or drive"
 
  --
 
  --
 
  tejun  (http://lkml.org/lkml/2008/12/2/426)
 
  tejun  (http://lkml.org/lkml/2008/12/2/426)
 
  ''(written by one of the foremost experts)''
 
  ''(written by one of the foremost experts)''
&lt;br&gt;&lt;br&gt;
+
<br><br>
 
-----
 
-----
 
====Drive interface issue #2====
 
====Drive interface issue #2====
Line 138: Line 130:
 
These errors are usually related to a bad cable or cable connector.  The presence of '''BadCRC''' is a pretty good indicator of a poor quality SATA cable.
 
These errors are usually related to a bad cable or cable connector.  The presence of '''BadCRC''' is a pretty good indicator of a poor quality SATA cable.
  
&lt;br&gt;&lt;br&gt;
+
<br><br>
 
-----
 
-----
 
====Drive interface issue #3====
 
====Drive interface issue #3====
Line 146: Line 138:
 
     ata2: SError: { '''UnrecovData Handshk''' }
 
     ata2: SError: { '''UnrecovData Handshk''' }
  
  &quot;This is transmission error. Most common causes are power related or
+
  "This is transmission error. Most common causes are power related or
 
  unreliable connection especially if backplanes are involved. Is the
 
  unreliable connection especially if backplanes are involved. Is the
 
  problem still reproducible? If so, can you please try to move it to
 
  problem still reproducible? If so, can you please try to move it to
  different power connector and SATA port and see what changes?&quot;
+
  different power connector and SATA port and see what changes?"
 
  --
 
  --
 
  tejun
 
  tejun
&lt;br&gt;&lt;br&gt;
+
<br><br>
 
-----
 
-----
 
====Drive interface issue #4====
 
====Drive interface issue #4====
Line 177: Line 169:
  
 
Good quality SATA and power cables and splitters are strongly recommended.  Then always make certain that they are firmly connected, and not subject to vibration.  The same is even more important for backplanes, make sure that drives are firmly and well seated in their trays, and cannot be vibrated loose.
 
Good quality SATA and power cables and splitters are strongly recommended.  Then always make certain that they are firmly connected, and not subject to vibration.  The same is even more important for backplanes, make sure that drives are firmly and well seated in their trays, and cannot be vibrated loose.
&lt;br&gt;&lt;br&gt;
+
<br><br>
  
&lt;br&gt;
+
<br>
 
-----
 
-----
 
===Physical Drive Issues===
 
===Physical Drive Issues===
Line 188: Line 180:
 
A typical example:
 
A typical example:
 
  ata3.00: cmd 60/00:10:4f:80:81/04:00:66:00:00/40 tag 2 ncq 524288 in
 
  ata3.00: cmd 60/00:10:4f:80:81/04:00:66:00:00/40 tag 2 ncq 524288 in
           res 41/40:64:eb:80:81/85:03:66:00:00/40 Emask 0x409 ('''media error''') &lt;F&gt;
+
           res 41/40:64:eb:80:81/85:03:66:00:00/40 Emask 0x409 ('''media error''') <F>
 
  ata3.00: status: { DRDY '''ERR''' }
 
  ata3.00: status: { DRDY '''ERR''' }
 
  ata3.00: error: { '''UNC''' }
 
  ata3.00: error: { '''UNC''' }
  
 
These are almost always associated with bad sectors.  They should be confirmed by examining a SMART report.  See the [[Troubleshooting#Obtaining_a_SMART_report|Troubleshooting page, Obtaining a SMART report]] section.  Then run a SMART long test (instructions in same section), to locate the bad sectors.  You may need to seek advice as to what to do next, as it will depend on your specific situation.
 
These are almost always associated with bad sectors.  They should be confirmed by examining a SMART report.  See the [[Troubleshooting#Obtaining_a_SMART_report|Troubleshooting page, Obtaining a SMART report]] section.  Then run a SMART long test (instructions in same section), to locate the bad sectors.  You may need to seek advice as to what to do next, as it will depend on your specific situation.
&lt;br&gt;&lt;br&gt;
+
<br><br>
  
&lt;br&gt;
+
<br>
 
-----
 
-----
 
===Other Drive Issues===
 
===Other Drive Issues===
 
====Unexpected loss of removable drive====
 
====Unexpected loss of removable drive====
  
  &quot;I get a lot of messages like the following in the syslog...
+
  "I get a lot of messages like the following in the syslog...
  What are they and should I be concerned?&quot;
+
  What are they and should I be concerned?"
 
  ---
 
  ---
 
  Mar 10 14:59:10 Tower kernel: '''FAT: Directory bread(block 510) failed'''
 
  Mar 10 14:59:10 Tower kernel: '''FAT: Directory bread(block 510) failed'''
Line 215: Line 207:
 
** Check the syslog for evidence related to its IRQ
 
** Check the syslog for evidence related to its IRQ
 
* more to be added, as discovered
 
* more to be added, as discovered
&lt;br&gt;
+
<br>
 
You will have to power off to get the system back, and most likely, unRAID will want to start a parity check, because it cannot update the flash drive with a proper shutdown.  Any settings changes won't be saved either, until the flash drive is accessible again.
 
You will have to power off to get the system back, and most likely, unRAID will want to start a parity check, because it cannot update the flash drive with a proper shutdown.  Any settings changes won't be saved either, until the flash drive is accessible again.
&lt;br&gt;&lt;br&gt;
+
<br><br>
  
 
-----
 
-----
&lt;br&gt;&lt;br&gt;
+
<br><br>
  
 
==Firmware upgrades==
 
==Firmware upgrades==
Line 230: Line 222:
 
* http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-01/msg09298.html
 
* http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-01/msg09298.html
  
&quot;There are a few drives which are currently marked to disable NCQ and warn the user that the firmware that should be upgraded:&quot;
+
"There are a few drives which are currently marked to disable NCQ and warn the user that the firmware that should be upgraded:"
 
* ST31500341AS
 
* ST31500341AS
 
* ST31000333AS
 
* ST31000333AS
Line 248: Line 240:
 
Firmware Update for ST3500320AS, ST3500620AS, ST3500820AS, ST3640330AS, ST3640530AS, ST3750330AS, ST3750630AS, ST31000340AS
 
Firmware Update for ST3500320AS, ST3500620AS, ST3500820AS, ST3640330AS, ST3640530AS, ST3750330AS, ST3750630AS, ST31000340AS
 
* Seagate Barracuda 7200.11
 
* Seagate Barracuda 7200.11
* AD14, SD15, SD16, SD17, SD18, SD19, SD81  -&gt; SD1A
+
* AD14, SD15, SD16, SD17, SD18, SD19, SD81  -> SD1A
 
* http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951
 
* http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951
  

Revision as of 13:28, 24 November 2010


Page under construction, but hopefully is starting to be helpful

There are a number of drive related errors, and many of them are similar, but point to very different issues. Some experience in analyzing these errors is therefore recommended, because the steps to resolve the problem are highly dependent on what the real problem is. For example, some errors point to a bad cable, other very similar messages point to a failing drive, and others may point to a bad disk controller, poor or incorrect driver, bad or insufficient power to the drive, etc. There have been too many cases of drives thrown out or returned by an RMA process, when the problem was just a bad cable. This page has been designed to help with the analysis of drive problems, and often to recommend the next steps to take.

The two most important tools for initial analysis are the syslog and the SMART report.


Drive problems by keyword




Drive problems by error message


There are many kinds of drive errors. Examine each section below for the highlighted key words that most closely match the errors you see, in your syslog.

The examples below will often include the ATA channel number involved with a particular drive. The actual numbers are not important, and will be different for each drive. The channel itself is usually something like ata3 or ata12, the actual attached drive will be something like ata2.00 and ata13.01. Most will end in .00, as there is only one drive per SATA channel, but IDE or IDE emulating channels may have 2 (eg. ata4.00 and ata4.01, master and slave), and port multipliers and SAS channels can have even more, such as ata5.00, ata5.01, ata5.02, ata5.03, and ata5.04.


Drive Interface Issues

These are problems with the cables and connections to the drive, both power and data, or the quality of the power supplied. If your errors match one of these, then almost certainly, your drive is completely fine. There have been many drives returned or thrown out, after numerous errors similar to the following issues, that were entirely the fault of the cables or connectors used, NOT the drive itself.

Drive interface issue #1

ata3.00: exception Emask 0x50 SAct 0x1 SErr 0x280900 action 0x6 frozen
ata3.00: irq_stat 0x08000000, interface fatal error
ata3: SError: { 10B8B BadCRC }   often also DisPar and UnrecovData and HostInt
"Your machine seems to be suffering genuine link layer problem.
In most cases, this indicates hardware problem and in my experience,
common causes are (in the order of ballpark frequency)...
# inadequate power supply
# device and controller don't like each other on 3Gbps
# cable too long or flaky connector (especially with eSATA cables or genders or backplanes)
# faulty controller or drive"
--
tejun  (http://lkml.org/lkml/2008/12/2/426)
(written by one of the foremost experts)




Drive interface issue #2

   res 40/00:00:48:19:67/00:00:1e:00:00/40 Emask 0x50 (ATA bus error)
   ata3: SError: { UnrecovData HostInt 10B8B BadCRC }

These errors are usually related to a bad cable or cable connector. The presence of BadCRC is a pretty good indicator of a poor quality SATA cable.




Drive interface issue #3

   ata2.00: exception Emask 0x10 SAct 0x7ff4f SErr 0x400100 action 0x6 frozen
   ata2.00: irq_stat 0x08000000, interface fatal error
   ata2: SError: { UnrecovData Handshk }
"This is transmission error. Most common causes are power related or
unreliable connection especially if backplanes are involved. Is the
problem still reproducible? If so, can you please try to move it to
different power connector and SATA port and see what changes?"
--
tejun




Drive interface issue #4

This is an example of what is probably a loose backplane or cable issue: (could be either the SATA connection or the power connection or both)

ata7.00: exception Emask 0x10 SAct 0x7 SErr 0x990000 action 0xa frozen
ata7.00: irq_stat 0x00400000, PHY RDY changed
ata7: SError: { PHYRdyChg 10B8B Dispar LinkSeq }
ata7.00: cmd 60/48:00:af:1b:97/00:00:10:00:00/40 tag 0 ncq 36864 in
         res 40/00:10:87:5f:96/00:00:10:00:00/40 Emask 0x10 (ATA bus error)
ata7.00: status: { DRDY }
ata7: hard resetting link
ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata7.00: qc timeout (cmd 0xec)
ata7.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata7.00: revalidation failed (errno=-5)
ata7: failed to recover some devices, retrying in 5 secs

Note: There are no CRC errors here, which normally implicate a bad cable or two.

These problems are often related to a backplane, perhaps loose, perhaps vibration-related, perhaps defective. If the SATA link remains up for awhile, but communications are clearly bad, then the emphasis should probably be on the power connection. The easiest way to test whether it is the fault of the backplane is to reinstall the drive outside of the backplane.

If there is no backplane involved, then the same considerations apply to the cable connections, each end of both the SATA and power cables, including any power cable splitters that may be involved. It is common after opening a computer case, to jostle the cables, and SATA cables are notorious for coming loose, if they aren't the locking type. It is a good habit to check all SATA connections just before closing a case up.

Good quality SATA and power cables and splitters are strongly recommended. Then always make certain that they are firmly connected, and not subject to vibration. The same is even more important for backplanes, make sure that drives are firmly and well seated in their trays, and cannot be vibrated loose.



Physical Drive Issues

These are actual errors from the drive itself, perhaps a failing drive, or perhaps just failing sectors. In general, you will always want to Obtain a SMART report for the drive.

Drive media issue #1

A typical example:

ata3.00: cmd 60/00:10:4f:80:81/04:00:66:00:00/40 tag 2 ncq 524288 in
         res 41/40:64:eb:80:81/85:03:66:00:00/40 Emask 0x409 (media error) <F>
ata3.00: status: { DRDY ERR }
ata3.00: error: { UNC }

These are almost always associated with bad sectors. They should be confirmed by examining a SMART report. See the Troubleshooting page, Obtaining a SMART report section. Then run a SMART long test (instructions in same section), to locate the bad sectors. You may need to seek advice as to what to do next, as it will depend on your specific situation.



Other Drive Issues

Unexpected loss of removable drive

"I get a lot of messages like the following in the syslog...
What are they and should I be concerned?"
---
Mar 10 14:59:10 Tower kernel: FAT: Directory bread(block 510) failed
Mar 10 14:59:10 Tower kernel: FAT: Directory bread(block 511) failed

Usually when those errors appear, the system has lost contact with the flash drive.

  • It could be the USB port (loose or faulty)
    • Try re-seating the flash drive
    • Try connecting to a different USB port
  • It could be the flash drive is going bad
    • Test it on another machine
  • It could be a shared IRQ has been disabled, one that serviced this USB port
    • Check the syslog for evidence related to its IRQ
  • more to be added, as discovered


You will have to power off to get the system back, and most likely, unRAID will want to start a parity check, because it cannot update the flash drive with a proper shutdown. Any settings changes won't be saved either, until the flash drive is accessible again.




Firmware upgrades

Warning!  highly disorganized and overlapping information below, copied from Internet sources

Seagate #1

"There are a few drives which are currently marked to disable NCQ and warn the user that the firmware that should be upgraded:"

  • ST31500341AS
  • ST31000333AS
  • ST3640623AS
  • ST3640323AS
  • ST3320813AS
  • ST3320613AS
  • all for firmware versions SD15 through SD19.

Firmware Update for ST31500341AS, ST31000333AS, ST3640323AS, ST3640623AS, ST3320613AS, ST3320813AS, ST3160813AS

Firmware Update for STM31000334AS, STM3640323AS, STM3320614AS, STM3160813AS

Firmware Update for ST3500320AS, ST3500620AS, ST3500820AS, ST3640330AS, ST3640530AS, ST3750330AS, ST3750630AS, ST31000340AS

Firmware Update for ST3250310NS, ST3500320NS, ST3750330NS, ST31000340NS

Seagate #2

Tech sites are reporting everywhere a massive flaw in seagate drives that can lock up the drive and make it unusable (the bios doesn't detect it, you can't read the data). Haven't read anything about it here on the lists. Seagate has ack'ed the problem:

So, apparently there're a lot of drives on the market (including mine) that can die any day. Are those drives going to be blacklisted? It's still not clear if the firmware update is safe (some affected but working drives are dying after the firmware update), so some people like me is still waiting (and hoping that the drive doesn't die) for more stable firmware updates...

Here is the list of drives+firmware affected, according to the support site as of now. Some models are still being diagnosed.


Seagate Barracuda 7200.11 (http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951)

Models Affected:

  • ST3500320AS
  • ST3640330AS
  • ST3750330AS
  • ST31000340AS

Firmware Affected

  • SD15, SD16, SD17, SD18, SD19, AD14

Recommended Firmware Update

  • SD1A

Seagate Barracuda 7200.11, page 2 (http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207957)

Models Affected:

  • ST31500341AS
  • ST31000333AS
  • ST3640323AS
  • ST3640623AS
  • ST3320613AS
  • ST3320813AS
  • ST3160813AS

Firmware Affected

  • Still Unknown

Recommended Firmware Update

  • Still Unknown

Seagate Barracuda ES.2 (http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207963)

Models Affected:

  • ST3250310NS
  • ST3500320NS
  • ST3750330NS
  • ST31000340NS

Firmware Affected

  • Still Unknown

Recommended Firmware Update

  • Still Unknown

DiamondMax 22 (http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207969)

Models Affected:

  • STM3500320AS
  • STM3750330AS
  • STM31000340AS

Firmware Affected

  • MX15 (or higher)

Recommended Firmware Update

  • MX1A

DiamondMax 22 (http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207975)

Models Affected:

  • STM31000334AS
  • STM3320614AS
  • STM3160813AS

Firmware Affected

  • Still Unknown

Recommended Firmware Update

  • Still Unknown