A new Ubuntu based server I have set up recently had a power failure which unexpectedly resulted in the box not booting again. There were actually two problems:
- fsck failed on the data mount because one of the data drives apparently had failed. It took forever but eventually prompted for user input “S” to skip or “M” to fix manually.
- The first time this happened I just tried powercycling the computer again hoping it would just come up. Unfortunately Grub detected a failure and disabled the timeout for the boot menu. So the box was sitting there in the Grub boot menu.
Unfortunately this server is supposed to be headless (and is mounted to the wall 4m above ground), so there was not even a keyboard where somebody could blindly press one of these keys or press return to select an option in the Grub menu. But sshd wasn’t started yet, so I could ping the server (the IP stack was working) but not ssh into it to fix the problem. So I got myself a really long VGA cable and an USB extension cable to connect a monitor and a keyboard to look at the actual console.
The second issue can be solved easily:
In /etc/default/grub add an the following entry:
GRUB_RECORDFAIL_TIMEOUT=5
This lets Grub show the boot menu for 5 seconds and then tries to boot normally. I used 5 seconds rather than 0 so I could actually use that menu if need arises.
The first issue is a bit more involved. I want the box to at least boot to the state where I can access it through ssh even if the data drives fail. That means I have to remove the mount point from /etc/fstab but have to put the mount command somewhere later into the boot process. One option is to mount it in /etc/rc.local like this (suggested here):
fsck -n UUID=...
if [[ $? != 0 ]]; then
logger -p user.warning "/etc/rc.local: fsck fail $?"
else mount ....
fi
I’ll not be going that way because the system is not that critical. If it doesn’t come up, we will notice and just ssh into it and fsck and mount the data volume manually.