Workstation crashed, filesystem corrupted
The title says it all folks. This is the first time in my Linux usage history (which begins somewhere in 1995-1996) that I have had a workstation crash due to something out of my control. I will admit that I have caused my workstations to crash a number of times due to my own stupidity or carelessness, no sense in denying that. What is very different in this case is that my workstation crashed this morning due to a kernel bug and what's more this is the worst kind of bug -- one that is known to exist.
Over the course of the last few days I have had several incidents where I had to reboot the computer and do a n extended, manual filesystem check because of the following error:
EXT3-fs error (device dm-0) in start_transaction: Journal has aborted
It's not exactly that error that causes the problem, apparently its a page allocation failure. I don't know, I'm never around when it happends and all I ususally see is the above error in my really long dmesg buffer.
A little background is in order, I am using CentOS 4.2 with ext3 and lvm support. It's the default install for a 'Workstation' computer in the graphical installer. CentOS is the 'Community Enterprise OS' which is based off of the RHEL source RPMS. The CentOS people do a good job rebuilding the packages and I have been very happy with my experience. I don't use RHEL because I have to pay $179 dollars for it. Just to let you know, I can purchase Microsoft Windows XP for less than that! Please! RedHat I don't need to be able to call you for support, I can read your Bugzilla Bug 149088 about my problem. In fact, Google found it for me - it was even at the top of the search results. I can also implement the 'fix' which is to basically not use the Enterprise Linux kernel (the EL series) and use the FC4 'community' kernel. This fix isn't sanctioned by RedHat by the way, it serves to solve the bug reporter's problem and isolate the problem to the EL kernel itself.
Scary stuff that the 'Enterprise' package is inferior to the 'Community' package. Even more scary, the bug was opened on February 18, 2005 and I can reliably reproduce the error on my system. All I have to do is run some disk benchmark tools or do use an application that exercises the disk such as running qmail and receiving several hundred emails (mostly spam).
I am currently downloading the CentOS 4.2 RPMs from the CentOS 4.0 rescue disk so that I can reinstall the files that were corrupted and lost during the fsck. I have to use 'rpm --force --noscripts' to install them because neither the rpm command nor yum has a 'reinstall' command line option.
In any case, I think RedHat needs to rethink the decision to make RHEL a non-mainstream distribution. They have plenty of smart people working for them but just not enough of them and I doubt they could afford the numbers needed to catch all of these issues and fix them.
I wonder how long it would have taken for this to be fixed in Fedora? Isn't RHEL supposed to be based off of Fedora or just the 'idea' of Fedora?