Difference between revisions of "Troubleshooting"

From unRAID
Jump to: navigation, search
(removing redundant entries)
(removing striked entries)
Line 201: Line 201:
  
 
Notes:  
 
Notes:  
<strike>* ''Some newer drive controllers do not like the "-d ata" command line option. If you do not get results from the form above, try running the command with the "-A" option instead (e.g., smartctl -a -A /dev/sda).  This applies to all of the examples below.''</strike>
 
 
 
* ''If you get an error like "'''error while loading shared libraries:  libstdc++.so.6'''", then you are using a version (such as v4.4.2) that is missing a required library.  Please see [http://lime-technology.com/forum/index.php?topic=2817.msg23548#msg23548 this post].''
 
* ''If you get an error like "'''error while loading shared libraries:  libstdc++.so.6'''", then you are using a version (such as v4.4.2) that is missing a required library.  Please see [http://lime-technology.com/forum/index.php?topic=2817.msg23548#msg23548 this post].''
  

Revision as of 05:06, 11 April 2013

Trouble with your unRAID server?

This is the best place to start. Although the unRAID forums are a great place to find help, it is best to work through the tips and suggestions of this page first.

If you still need assistance, then there are a lot of helpful unRAID users on the unRAID forums. If appropriate, start by searching for posts similar in nature to your issue. Registration is not necessary to browse the forums, although we would like to hear from you. However, if you need to ask for help, you MUST register first.

If you have questions, please check the FAQ and the Best of the Forums first.

Here is a community statement about unRAID support, how it is handled, how long it may take, commercial vs. community support, etc.

How to get help

If after searching the forums (use advanced search) and this wiki, and you are not able to resolve your problem, here is an (almost) foolproof way of asking for help that will give the experts the info they need to help you quickly. If you don't provide enough detail, you risk annoying the gurus who may not answer your cry for help. This procedure will work on ALL versions of unRAID from 4.2.1 through 4.4.2.

Please post only after you have at least tried to find help on the topic. Many problems are common, and a little research can quickly have you up and running again. Try the FAQ first, then the Best of the Forums page, then the general wiki pages. Going the forum route takes more time, but is needed in some situations.

Keep reading the rest of this wiki and some of these commands are explained in greater detail, and other options are provided. What is presented here is a "what but not why" set of steps for capturing and posting a log. This is a very basic process that can be used for reporting any problem from network outages to disks not being recognized to other oddities.

Sometimes the worst thing you can do is try something that could make the problem worse, so avoid the urge to try anything risky!

1. Go to the unRAID server

2. Log on (unless you changed it, user name is "root" with no password) (Press the enter key first if you do not see a login prompt)

3. Enter these commands

 cp /var/log/syslog /boot/syslog.txt
 chmod a-x /boot/syslog.txt

4. Shutdown the unRAID box by one of the following methods:

  • Stop the array from the unRAID Web Management page with the Stop button. Then use the Power Down button to shut down your server. (For information about installing a stop script, please see here.
  • If you have installed the powerdown package, then you can type powerdown at the console prompt, and a safe shutdown will proceed. (For information about installing powerdown, please see here and here and here.)
  • If you have installed a stop script on your flash drive, then you can use it in the next method.
  • The stop command is no longer included with unRAID. But if you are running an older version, earlier than around v4.3, you can try the following sequence of commands:
 stop
 sync
 powerdown

Wait for each command to finish, or display an error, or turn the machine off. If the stop command results in an error, keep going. The sync command does not produce output. The powerdown command may not be available either.

  • If none of the methods above were available, or did not work (the machine is still running), then type the poweroff command. Unfortunately, if you have to use the poweroff command, the array may not be stopped correctly, and a parity check may begin on the next boot. Your array is safe though.

5. Wait for computer to power off

6. Remove the USB stick from your unRAID server and put into your desktop computer

7. Start a new thread in the unRAID forums (see Troubleshooting#Creating a forum post about your problem)

8. Enter a description of the problem, title it something useful like "HELP: Drive with all my data says its unformatted"

9. Attach the syslog.txt file from your flash drive to the post

10. Take a Valium and wait patiently for a response.


Capturing your syslog

The more information you can give others about the problem, the quicker you can get back to normal operation. There is probably nothing more important to helping others help you, than by capturing your syslog.

Every time you reboot, the syslog file is replaced. So if you have a failure, it is important to capture the syslog before you reboot!. Otherwise, any chance of understanding what happened to cause your failure will be lost.

Here's how:

To obtain a copy of your current syslog, at the unRAID console or in a Telnet session, type the command:

 cp  /var/log/syslog  /boot

This will make a copy of the system log in the root directory of your flash drive, which you can either copy directly from the flash share of your server, or plug the flash drive into your PC and access the syslog there. Any file manager such as Windows Explorer can access the file across the network. For example, if your unRAID server name is Tower, then you can access your newly created syslog as \\Tower\flash\syslog. It is recommended to rename it with the date and time and the .txt extension, for example syslog-2007-08-28-1630.txt.

One warning, many of the files on the flash drive, including the syslog, may be flagged System and Hidden, so don't forget to adjust Windows Explorer to be able to see System and Hidden files. This command will remove the System and Hidden flags:

 chmod a-x /boot/syslog

The above syslog copy command is the simplest form, but leaves it on the flash with just the name syslog. Here's another form, that gives it a fuller name, ready to copy to an unRAID logs folder, or attach to an email or forum post.

 cp /var/log/syslog /boot/syslog-2008-04-10.txt
 chmod a-x /boot/syslog-2008-04-10.txt

An easier way to access the syslog, if you have installed the UnMENU Add On, and you have network access to your unRAID server, is to go to the UnMENU web page, click the Syslog plugin link, then click the syslog download link. This will save it directly to your computer.

As of unRAID v4.5-beta2, you can access your syslog directly by browsing to http://tower/log/syslog (substitute 'tower' with your server name). You will want to put that into your browser's location box, change the name to your unRAID server's name, go there, then save a bookmark or favorite for the page. Now you can use your browser's Save Page option to save a copy of your syslog to your computer.

See also Viewing the System Log

How often should you save a copy of your syslog? Right now, of course, for a baseline copy, and then as often as needed. You can always delete the old and extra ones later. A syslog contains most of the unRAID setup, especially of the drives, plus most or all of the issues reported for that session. But once you reboot, it is gone forever, unless the syslog was saved.

Having multiple syslogs saved allows you, or someone helping you, to use file comparison tools that help to quickly isolate what is different between 2 syslogs, especially if there is a baseline syslog and another syslog that covers a problem period. Total Commander has a built-in file comparison tool for quick analysis. WinMerge (with a suggested 'prediffer' of 30 columns) provides better analysis, with better handling of moved lines and added or missing sections. Quick isolation of just the changes is important when reading syslogs, because it's often more about what to ignore, than what to look for. 'Before and after' syslogs or 'baseline and problem' syslog pairs are ideal for syslog analysis.


If you cannot copy your syslog

If the instructions above were not successful, then there may be other problems. If you cannot mount your flash drive, or have lost access to it, then /boot may have disappeared, and the instructions above won't work. If so, try this variation, which copies the syslog to the first data disk:

cp /var/log/syslog /mnt/disk1/syslog.txt

Then after rebooting, if you have network access to your unRAID server, you can copy it from Disk 1.

If the reason you cannot access the syslog is because the system appears to have crashed, then it is probably too late. Do try to capture a syslog BEFORE the system becomes unresponsive. It may not cover the time of failure, but may have the info needed for troubleshooting. Even a syslog copied immediately after booting may be helpful, certainly better than nothing.

Also, you can try the following command at the console prompt. It will fill your unRAID screen with the tail end of the syslog, and *may* show you the error(s) as they happen.

tail -f --lines=100 /var/log/syslog

Using your digital camera, you can take a photo of your screen, avoiding as much glare as possible, and post the image. A screen pic is better than nothing at all.

If you are running headless (no monitor and possibly no keyboard or graphics card), then you can try directing the output of the command above to your flash drive or a data drive. For example, the following will output the last syslog lines to syslogtail.txt on your flash drive. This should allow you to obtain the very last message that the system was able to log.

tail -f --lines=100 /var/log/syslog >/boot/syslogtail.txt

Be aware though that there are some troubleshooting issues where you *must* hook up a monitor!


Creating a forum post about your problem

The first thing to do is to select the appropriate forum for your issue. If your problem seems related to a specific version of unRAID, then select that forum. Select the unRAID Server 4.4 forum for problems related to the latest unRAID releases. If it is a motherboard-related question, then use the Motherboards forum. If it is a hardware problem, then use the Hardware forum. If it doesn't seem to fit in the already mentioned forums, then try the Software forum.

After entering the correct forum, click the New Topic button. Your soapbox is now ready. Start with an appropriate subject heading, not too general, but something specific to your problem. Then clearly indicate what the problem is, and include the exact wording of any error messages (if any). Add some detail of your hardware setup, including motherboard, CPU, amount of RAM, your flash drive, addon cards - especially disk controllers, and the drives you have installed. If appropriate, indicate the version of unRAID.

And of course, as mentioned in the section above, if a syslog would be useful to a troubleshooter, attach it now. Don't wait to be asked for it! Some problems don't need a syslog, but most do. It is our window into the internal workings of the Linux kernel and the unRAID driver, what it is seeing and what it is doing. A problem getting a USB flash drive to boot is one type of issue that usually does NOT involve a syslog. For USB boot problems, see below.

There is a limit on the size of attachments, so if your syslog is too large, perhaps because of many error log entries, you should zip it, and attach the zip file. The syslog, especially if there are repetitive error entries, will compress very small. Normal syslogs zip to around 15% of original size, and those with errors are typically much larger and contain lots of repetition, which compresses way down, usually to 7% to 9% of original size. Zipping them also ensures they are received intact.

Note: if the syslog is large, attaching a zipped copy of it is preferred. Please DO NOT split your syslog! Also, do not attach it as an .rtf or .doc, or as anything but the original text file you captured.


Boot problems

There is probably nothing more frustrating than getting all excited about unRAID, spending hours reading about it, then grabbing a spare flash drive, only to waste many more hours trying fruitlessly to get it to boot unRAID. There have possibly been more potential unRAID users lost for this reason, than any other. Some flash drives are harder to prepare than others, and certainly, some motherboard BIOS's are much more picky about booting from a USB drive.

The instructions at USB Flash Drive Preparation are very complete, especially the troubleshooting tips. If still unsuccessful booting unRAID, then check the tips below (some are already covered in the previous instructions). If still unsuccessful, then it is time to post a question on the unRAID forums.

  • One of the most common issues is forgetting to set the Volume Label on the flash drive to UNRAID, exactly 6 capital letters.
  • Some BIOS's reshuffle boot order, especially when a new hard drive has been added. In your BIOS Setup Menus, try Harddisk-USB first. See here for more BIOS Setup Tips and Other BIOS Suggestions.
  • Check USB Boot Issues


Network not working

  • If you have added a NIC, make sure that any onboard LAN is disabled in the BIOS Setup Menus, and don't plug the network cable into the onboard NIC.
  • Make sure the workgroup name is the same on each of your machines.
  • See also the Networking FAQ.

Name Resolution

If you are having trouble connecting to your unRAID server from other machines on the network using the hostname (i.e. tower) but using the server's IP address works, you are having a name resolution problem.

  • If you are having trouble accessing the web management page, make sure you are using //tower or http://tower and not \\tower.
  • Make the unRAID machine the local master network browser by logging into the web management page (use the IP address to access it) and set Local Master on the settings page to Yes. Reboot the unRAID server and the computer that cannot resolve the network name and try again.
  • If name resolution still does not work, the workaround is to use the hosts file to set its IP address manually. For Windows, the hosts file is located at
%WINDIR%\System32\drivers\etc\hosts

For Linux it is at

/etc/hosts

Open the file in a text editor and at the bottom add

192.168.x.y<tab>hostname

Replace 192.168.x.y with the IP address of your unRAID server and replace hostname with the server's hostname (i.e. tower). Don't actually type <tab>, just press the tab key.

Hard drive failures

See also the Hard Drives FAQ. unRAID can recover from a SINGLE disk failure. It is actually easy to miss a failure unless you notice degraded performance. (Note you will only have degraded performance reading from the one drive that has actually failed, not from reading other disks in the array.) You may not even notice the degraded performance, as it is very likely it will still be sufficient to serve media files fast enough over the LAN that you would not notice. The only way to tell for sure if unRAID has detected a drive failure is to look for a red ball next to one of your drives on the Main page of the Web interface. But it is easy for even unRAID to miss the fact that a drive has failed if it is not accessed for a while. This is another reason to run the monthly parity check - to make sure that unRAID "knows" that a drive has failed.


How to prevent drive failures?

The best way to avoid failures is some preventative maintenance.

  • Do not use round IDE cables. Although sold as premium cables, they do not meet the technical specifications for high speed use. Just because they have worked fine for 5 years on your Windows box, does not mean you will have good luck with them with unRAID. (You have been warned!) Instead use the flat cables that come with most motherboards (use the 80 pin cables, not the 40 pin cables made for CD ROM use - see this picture. Use a cable that looks like the top cable, not the bottom one).
  • Make sure your cooling is adequate. High heat stresses all parts of your computer, including your hard drives. Although it is hard to give a precise temperature to start to be concerned about, temps below 40 are good, between 41-45 are getting warm, and temps above 45 should steer you towards adding active cooling on your hard disks. I (personally) would shutdown my server with hard drive temps over 50C. See also the UnRAID Topical Index, Fans topic.
  • Run a periodic parity check (every month). We recommend running the Monthly parity check script from the UnRAID Add Ons wiki page. Although you may not realize it, hard drives have an internal error checking system, known as S.M.A.R.T., that monitors all drive operations, including the media surface condition. If a spot on the disk starts to go bad, the drive can "remap" the sector, and avoid reporting a bad sector error. It does this by taking the bad sector offline, and mapping a spare sector (from a reserved pool of spares) into its place, and moving the contents of the bad sector to the replacement one. It does this quietly, transparently to the system, so that no errors are reported, just logged within the SMART system of the drive. But if the drive doesn't read a sector for a very long time, that sector can go from good to marginal to bad without the drive noticing. Running a parity check (besides verifying that your parity is being properly maintained) will also cause each and every sector of every disk to be read, and give your drive's SMART monitoring a chance to take corrective action, and prevent a future error.

Cabling Problems

Hard drives can sometime appear to fail or be failing when actually the problem is a loose cable. Although most common when installing new drives, the vibrations inside the computer can cause cables to become unsecured. The first step when you lose a drive, hear pops and clicks from your drive, or see resets or other errors in the syslog, it to identify the drive causing problems, and to unplug and replug (and if that doesn't work replace) the data cable. Also check the power cable while you're at it. If you are using backplanes, they become a part of the cabling from unRAID's perspective. Make sure that the backplane isn't causing the problem. (see the smartctl section below for more hints of cabling problems).


What if I get an error?

  • If your array has been running fine for days/weeks/months/years and suddenly you notice a non-zero value in the error column of the web interface, what does that mean? Should I be worried?
  • Occasionally unRAID will encounter a READ error (not a WRITE error) on a disk. When this happens, unRAID will read the corresponding sector contents of all the other disks + parity to compute the data it was unable to read from the source. It will then WRITE that data back to the source drive. Without going into the technical details, this allows the source drive to fix the bad sector so next time, a read of that sector will be fine. Although this will be reported as an "error", the error has actually been corrected already. This is one of the best and least understood features of unRAID!
  • There may be OTHER types of errors than this one, so it is certainly worth your while to capture a syslog after an error is detected, but this is likely what has happened. Also, if you notice this happening more than once in a very great while, you might want to consider testing and replacing the disk in question. Remapped sectors have been linked with higher than normal drive failure.
  • After getting an error, run a parity check soon after, to make sure that all is well.


What do I do if I get a red ball next to a hard disk?

  • If you have moved your drives around (or sometimes even if you haven't), unRAID can get confused about what drive is assigned to what slot. It will NOT START the array, and some drives may have red balls next to them. You will also see italicized drive serial numbers. You need to go to the Devices page and re-assign the right drives to the right slots. (The italicized serial numbers on the main page will guide you to assign the right drive to the right slot.) You can then safely start the array.

But if you see a red ball next to a single drive, and the unRAID array status indicates it is "Started" that disk has been taken out of service because an attempt to write to it has failed.

  • First you should know that unRAID does not take a disk out of service casually, but if a disk experiences a write failure, it will do exactly that, it will take the disk out of service. A write failure is serious. A single write failure will take a disk out of service and unRAID will then show a red indicator next to it in the management interface.
  • Many things can cause such a failure that have nothing to do with the drive. Cables can (and do) go bad or wiggle loose. SATA cables in particular are notorious for slipping off their connectors, if they aren't the locking type. PSU's (power supplies) can do weird things and induce failures. Motherboards can go bad. At a minimum, it is worth a little time to recheck all of the connections, to make sure something hasn't come loose. Whenever the computer case is opened, especially just before closing it up, cabling can shift and cause a connection to a drive to fail. When checking for loose connections, take care not to disturb connections to other drives, complicating your failure.
  • Drives are self-monitoring through their SMART features. There is a very nice utility called smartctl that is included by unRAID (click this LINK if using pre 4.3 final version of unRAID). Here are some instructions on using it from Tom. Also see unRAID Addons and UnRAID Topical Index, SMART for more Smartctl links. If when trying the smartctl commands below, you get an error about a missing library, then see this post for instructions for installing it.

Obtaining a SMART report

At the unRAID console, or from a Telnet or PuTTY session, type:

 smartctl  -a  -d  ata  /dev/sda
or if you are using a newer SATA controller
smartctl -a -A /dev/sda

Notes:

  • If you get an error like "error while loading shared libraries: libstdc++.so.6", then you are using a version (such as v4.4.2) that is missing a required library. Please see this post.

Look at the Devices page for the device identifier (within the parentheses) for each disk, and substitute that for 'sda' on the command line.

This command will print out the SMART info for the drive. Refer to this article to better understand the SMART report.

To copy the results to a file called smart.txt on your USB stick that you can use to post to the forums, use this command:

 smartctl  -a  -d  ata  /dev/sda >/boot/smart.txt
or if you are using a newer SATA controller
smartctl -a -A /dev/sda >/boot/smart.txt

This form makes it easier to look at the smart.txt file from a Windows workstation.

 smartctl  -a  -d  ata  /dev/sda | todos >/boot/smart.txt
or if you are using a newer SATA controller
smartctl -a -A /dev/sda | todos >/boot/smart.txt


The smartctl output will provide a bunch of statistics that the drive captures about itself.

  • Perhaps the most important attribute to look at is the "Reallocated_Sector_Ct", the RAW_VALUE is a count of sectors that have been reallocated/remapped. If a sector goes bad, the drive has the ability to "remap" a spare sector to the bad sector. This is done at a low level, within the drive itself, so the OS doesn't even know it happened. (unRAID actually uses this feature to maintain the integrity of your array.) Each time this happens, the reallocated sector count is incremented. Seeing a few reallocated sectors is not necessarily a bad thing, but seeing that number start to go up is often a sign that the drive is failing. Anytime you see a value other than 0 you should closely monitor the drive. If the number holds steady and does not increase even after several parity checks, your drive is likely okay. But if it seems to be going up by even 1 or 2 at a time, start to be concerned. This is likely the first hint that the drive is failing. A special note about reallocated sectors, bad cabling CANNOT cause reallocated sectors to occur.
  • An equally important attribute is the "Current_Pending_Sector", the RAW_VALUE is a count of suspect sectors pending reallocation. It should ALWAYS be zero. If not, then you will probably (but not always) see the Reallocated Sector Count increase in the future, when this does return to zero. Before remapping a suspect sector, it tests it one last time, and *may* pass it and not remap it. (There are good reasons why it is designed to work this way.)
  • Another important stat to look at is the "Temperature_Celsius". It tracks the current and min/max temperatures of the drive. If your drives are running hot (see recommendations in the "Preventative Maintenance" section above), consider adding active cooling to your hard drives.
  • One user had the "UDMA_CRC_Error_Count" greater than zero. Research showed that this can be caused by bad cabling ("Possible causes of UDMA CRC errors are bad interface cables or cable routing problems through electrically noisy environments (e.g., cables are too close to the power supply).")
  • Near the end of the smartctl report is a list of the last few errors the drive encountered. Errors that indicate that commands are not recognized is a sign of bad cabling, and not necessarily of a bad drive.
  • (Add more info about specific smartctl statistics)

Each of the smartctl attributes are provided in "raw" format (RAW_VALUE) as well as in "normalized" format (VALUE). The raw format is sometimes more human readable (like the temperature in Celsius or the reallocated sector count), but not always. They can also vary wildly from vendor to vendor. The normalized format shows the current value as a normalized value between 255 and 0 (higher is better). If the value falls below the "THRESH" value, it means that the drive is failing. The WORST normalized value is also shown.

When reviewing SMART attributes, see this helpful chart of Known S.M.A.R.T. attributes.

Running a SMART test

Smartctl provides drive tests that you can run. Smartctl does not actually conduct the tests, it just tells the drive to initiate a test on itself. You need to run the SMART report command, shown above, to get the results. If the test is still in progress, the SMART report will tell you that as well.

This short test takes 1 to 3 minutes (remember to substitute your drive's identifier for 'sda' as described above)

 smartctl -d ata -tshort /dev/sda  smartctl -t short /dev/sda

This long test takes about 2 to 4 hours

 smartctl -d ata -tlong /dev/sda smartctl -t long /dev/sda

To see the results, or the progress of the test, use the SMART report command, shown above in the Obtaining a SMART report section.

How do I re-enable the drive?

Okay, the cable was loose, or I think the failure was a fluke - how can I get unRAID to reuse this same disk that it thought had failed?

If you are sure that the drive is fine, and the SMART report confirms it, and you have not written to the drive since it was taken off-line, then use the Trust My Array procedure, to quickly recover your drive and the array to an all green condition. Remember, it was taken out of service when a "write" to it failed so using the "trust" procedure will effectively forget the data written while the drive was disabled. Unless you are certain you have not written to the disk a reconstruction is much better. The safest option is to reconstruct the drive. Only use the Trust procedure if a reconstruction is not possible.

You can re-enable the hard drive and reconstruct it as follows:

  • Stop the array.
  • Go to the Devices page and un-assign the disk.
  • Go to the main page and start the array.
  • Stop the array again.
  • Go to the Devices page and re-assign the disk.
  • Go to the Main page - system should indicate there is a "new" drive to replace the disabled one. Check the confirmation box and click Start to start a parity-reconstruct of the disk.

What to do if a hard drive fails?

If a hard drive fails and you need to replace it, refer to the unRAID documentation or this wiki page to replace the drive. This is a tried and true process that many many users have used. It is the same process used to replace a disk with a larger capacity disk.


General hardware issues

Note: we are still waiting for editors (including you if you are reading this!) to provide good step by step analysis of hardware issues, like testing memory, checking cables, checking temps and airflow, disabling components, BIOS and firmware updates, etc.


More Links