Trouble Booting Related to Hard Disk Errors [ClearOS Documentation]

Trouble Booting Related to Hard Disk Errors

When you boot your system you are confronted with a statement that there were problems with the disk. It prompts you to hit Ctrl+D or to supply the root password to continue. This error can occur when the file system crashes or detects an inconsistency that the normal repair of the journal cannot fix.

What's wrong

This can happen for the following reasons:

Bad shutdown: the system's power was interrupted in such a way that the data volume could not be mounted.
The system has reached an interval required by the startup sequence whereby a check was forced and during that check, inconsistencies were detected.
The drive is bad or some defect has been encountered.

How to fix it

Depending on the problem, this could be an easy fix or a difficult one. You need to think about what data is on this disk and whether or not you can tolerate data loss on the disk. If you are using ClearOS as a mere firewall, the data is probably not as critical as it would be if you are using it as a file server. There are several paths to take here:

Determine whether the physical disk is bad.
Image the volume to ensure optimal recovery.
Try the repair and see what happens.

Determine whether the physical disk is bad

Disk drives fail. Most modern disk drives have within them a chip called a SMART controller. It's job is to watch the disk to see if there are problems. You can use a bootable linux distribution like Parted Magic to boot to a live environment to see if your smart controller is reporting any problems. If your system has a hardware RAID controller, you may not be able to see the SMART chip in order to determine if your drive is bad.

Many people skip this step because it takes a long time. If your system has critical data that is not on a backup, we encourage you to take the steps necessary to determine if your physical disk is good or bad.

Image the volume to ensure optimal recovery

Imaging the disk is a great way to ensure the successful recovery of data. This process requires that the disk be removed from the system and mounted under another system that has:

a valid and modern Linux operating system
sufficient storage to capture the WHOLE disk twice over!

With this process, we will take a full data dump of the disk. Even if the disk(s) are part of a software RAID, they can be imaged, repaired and subsequently mounted. This process can take a long time to complete and the larger the disk, the more time it will take. But it is by far the safest route because once we have a data dump of the disk, we can make a copy of that data and try our repairs on the copy (guinea pig data) of the copy (original snapshot).

The lengthiness of the repair is a factor of the damage on the disk and the size. Some repairs can take just a few minutes…other's can take weeks.

AUTHOR NOTE: The longest repair I've been involved with took 3 months. The data was very corrupt and would take 2 weeks just to resolve a single fsck command. The first pass was unsuccessful. We originally tried an fsck without an image of the disk and when it came up still failed, we suspected the worst. We backed up the disk to an image and then made a copy and started the repairs on the copy. Eventually the repair completed and we got back 100% of the data that the customer was concerned with (1.5 TB)

To take your initial snapshot, add the disk to a separate system and copy the data to an image file somewhere where you have lots of space. Here is an example:

dd if=/dev/sda1 of=/path/to/large/storage/imagefile.img bs=512

Try the repair and see what happens

If you just want to see what a repair will do and you aren't worried about how it will turn out (ie. you have good backups or can rebuild the system in the worst case scenario) then proceed with the repair.

On boot, after you have provided the root password, you can start the repair by running the following (where '/dev/hda1' is the drive reported in the startup as being the problem):

fsck /dev/sda1

This process can take a while to fix and you must say 'y' to each question for it to repair the particular problem indicated. The bigger the disk, the longer it will take.

Once the disk is repaired, you will need to restart the system. Type the following to reboot the system:

reboot

If the disk comes up again to the same thing, please note the volume. In the case of multiple partition crashes, the partition indicated might be a different one than you repaired and you may need to repeat the steps on this other volume.

If it is the same problem, you may have an issue with the superblock. If the command you issued to repair the disk is actually trying to repair the disk itself and NOT a partition, you will ALWAYS get an error about the superblock (i.e. fsck /dev/hda would be trying to repair the disk and not the partition. Partitions are indicated by a number. For example: fsck /dev/sda1)

Help

Links

Troubleshooting Boot Process Dropping to Shell

search?q=clearos%2C%20clearos%20content%2C%20troubleshooting%2C%20help%2C%20support%2C%20rescue%2C%20booting%2C%20hard%20disk%2C%20maintainter_dloper&btnI=lucky

CLEAROS DOCUMENTATION

Table of Contents