[SlugLUG] When Bad Hard Drives Happen to Good Linuxes
Thomas Leavitt
thomas at thomasleavitt.org
Mon Oct 2 13:02:14 PDT 2006
Hmm... SMART doesn't generally tell me when a drive is going to die, but
if I see problems, it is a signal to dump the drive... for instance,
here's the output on my server's current main drive, which I'm migrating
off of within a day or two...
I typically see problems here:
195 Hardware_ECC_Recovered 0x000a 253 252 000 Old_age
Always -
15205
196 Reallocated_Event_Count 0x0008 183 183 000 Old_age
Offline -
70
197 Current_Pending_Sector 0x0008 248 247 000 Old_age
Offline -
51
198 Offline_Uncorrectable 0x0008 231 193 000 Old_age
Offline -
22
prior to a disk failure... now, the error log below says nothing much
has happened recently, but my instincts to date indicate that this drive
is suspect, and has been for a while...
SMART Error Log Version: 1
Warning: ATA error count 1909 inconsistent with error log pointer 5
ATA Error Count: 1909 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 1909 occurred at disk power-on lifetime: 12996 hours (541 days +
12 hours)
When the command that caused the error occurred, the device was in an
unknown
state.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 3b fa 2d 20 e0 Error: UNC 59 sectors at LBA = 0x00202dfa = 2108922
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 40 fa 2d 20 e0 08 00:09:27.280 READ DMA EXT
25 00 40 fa 2d 20 e0 08 00:09:24.432 READ DMA EXT
25 00 08 f2 2d 20 e0 08 00:09:24.432 READ DMA EXT
25 00 40 fa 2d 1c e0 08 00:09:24.432 READ DMA EXT
25 00 08 f2 2d 1c e0 08 00:09:24.432 READ DMA EXT
Error 1908 occurred at disk power-on lifetime: 12996 hours (541 days +
12 hours)
When the command that caused the error occurred, the device was in an
unknown
state.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
01 51 3b fa 2d 20 e0 Error: AMNF 59 sectors at LBA = 0x00202dfa = 2108922
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 40 fa 2d 20 e0 08 00:09:24.432 READ DMA EXT
25 00 08 f2 2d 20 e0 08 00:09:24.432 READ DMA EXT
25 00 40 fa 2d 1c e0 08 00:09:24.432 READ DMA EXT
25 00 08 f2 2d 1c e0 08 00:09:24.432 READ DMA EXT
25 00 40 fa 2d 18 e0 08 00:09:24.432 READ DMA EXT
Error 1907 occurred at disk power-on lifetime: 11073 hours (461 days + 9
hours)
When the command that caused the error occurred, the device was in an
unknown
state.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 06 1b 35 74 e0 Error: UNC 6 sectors at LBA = 0x0074351b = 7615771
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 1b 35 74 e0 08 8d+18:00:20.848 READ DMA EXT
25 00 08 6b 87 71 e0 08 8d+18:00:20.848 READ DMA EXT
35 00 08 7b 0f 07 e0 08 8d+18:00:20.816 WRITE DMA EXT
35 00 f8 83 0e 07 e0 08 8d+18:00:20.816 WRITE DMA EXT
35 00 08 fb 3f 2e e0 08 8d+18:00:20.816 WRITE DMA EXT
Error 1906 occurred at disk power-on lifetime: 6426 hours (267 days + 18
hours)
When the command that caused the error occurred, the device was in an
unknown
state.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 02 f6 2d 84 e0 Error: UNC 2 sectors at LBA = 0x00842df6 = 8662518
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 02 f6 2d 84 e0 08 00:21:24.512 READ DMA EXT
25 00 04 f4 2d 84 e0 08 00:21:23.296 READ DMA EXT
25 00 06 f2 2d 84 e0 08 00:21:22.064 READ DMA EXT
35 00 08 fa 2d 2c e0 08 00:21:22.064 WRITE DMA EXT
25 00 02 f8 2d 84 e0 08 00:21:22.048 READ DMA EXT
smartctl version 5.33 [i586-mandriva-linux-gnu] Copyright (C) 2002-4
Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: Maxtor 6Y250P0
Serial Number: Y64B29GE
Firmware Version: YAR41BW0
User Capacity: 251,000,193,024 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0
Local Time is: Mon Oct 2 13:00:07 2006 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x85) Offline data collection activity
was aborted by an interrupting
command f
rom host.
Auto Offline Data Collection:
Enabled.
Self-test execution status: ( 114) The previous self-test completed
having
the read element of the test failed.
Total time to complete Offline
data collection: ( 363) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection
on/off supp
ort.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 106) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_
FAILED RAW_VALUE
3 Spin_Up_Time 0x0027 181 181 063 Pre-fail
Always -
10437
4 Start_Stop_Count 0x0032 253 253 000 Old_age
Always -
35
5 Reallocated_Sector_Ct 0x0033 245 244 063 Pre-fail
Always -
83
6 Read_Channel_Margin 0x0001 253 253 100 Pre-fail
Offline -
0
7 Seek_Error_Rate 0x000a 253 252 000 Old_age
Always -
0
8 Seek_Time_Performance 0x0027 250 243 187 Pre-fail
Always -
42539
9 Power_On_Minutes 0x0032 209 209 000 Old_age
Always -
221h+29m
10 Spin_Retry_Count 0x002b 253 252 157 Pre-fail
Always -
0
11 Calibration_Retry_Count 0x002b 253 252 223 Pre-fail
Always -
0
12 Power_Cycle_Count 0x0032 253 253 000 Old_age
Always -
255
192 Power-Off_Retract_Count 0x0032 253 253 000 Old_age
Always -
0
193 Load_Cycle_Count 0x0032 253 253 000 Old_age
Always -
0
194 Temperature_Celsius 0x0032 253 253 000 Old_age
Always -
43
195 Hardware_ECC_Recovered 0x000a 253 252 000 Old_age
Always -
15205
196 Reallocated_Event_Count 0x0008 183 183 000 Old_age
Offline -
70
197 Current_Pending_Sector 0x0008 248 247 000 Old_age
Offline -
51
198 Offline_Uncorrectable 0x0008 231 193 000 Old_age
Offline -
22
199 UDMA_CRC_Error_Count 0x0008 199 199 000 Old_age
Offline -
0
200 Multi_Zone_Error_Rate 0x000a 253 252 000 Old_age
Always -
0
201 Soft_Read_Error_Rate 0x000a 253 252 000 Old_age
Always -
4
202 TA_Increase_Count 0x000a 253 183 000 Old_age
Always -
0
203 Run_Out_Cancel 0x000b 253 252 180 Pre-fail
Always -
3
204 Shock_Count_Write_Opern 0x000a 253 252 000 Old_age
Always -
0
205 Shock_Rate_Write_Opern 0x000a 253 252 000 Old_age
Always -
0
207 Spin_High_Current 0x002a 253 252 000 Old_age
Always -
0
208 Spin_Buzz 0x002a 253 252 000 Old_age
Always -
0
209 Offline_Seek_Performnce 0x0024 195 189 000 Old_age
Offline -
0
99 Unknown_Attribute 0x0004 253 253 000 Old_age
Offline -
0
100 Unknown_Attribute 0x0004 253 253 000 Old_age
Offline -
0
101 Unknown_Attribute 0x0004 253 253 000 Old_age
Offline -
0
the other drive is not in use, and shows no errors...
smartctl version 5.33 [i586-mandriva-linux-gnu] Copyright (C) 2002-4
Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: Maxtor 6Y200P0
Serial Number: Y66QQVSE
Firmware Version: YAR41BW0
User Capacity: 203,928,109,056 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0
Local Time is: Mon Oct 2 13:05:13 2006 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection:
Enabled.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test
has ever
been run.
Total time to complete Offline
data collection: ( 362) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection
on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 88) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0027 252 252 063 Pre-fail
Always - 4564
4 Start_Stop_Count 0x0032 253 253 000 Old_age
Always - 14
5 Reallocated_Sector_Ct 0x0033 253 253 063 Pre-fail
Always - 0
6 Read_Channel_Margin 0x0001 253 253 100 Pre-fail
Offline - 0
7 Seek_Error_Rate 0x000a 253 252 000 Old_age
Always - 0
8 Seek_Time_Performance 0x0027 252 249 187 Pre-fail
Always - 48471
9 Power_On_Minutes 0x0032 237 237 000 Old_age
Always - 348h+29m
10 Spin_Retry_Count 0x002b 252 252 157 Pre-fail
Always - 0
11 Calibration_Retry_Count 0x002b 253 252 223 Pre-fail
Always - 0
12 Power_Cycle_Count 0x0032 253 253 000 Old_age
Always - 16
192 Power-Off_Retract_Count 0x0032 253 253 000 Old_age
Always - 0
193 Load_Cycle_Count 0x0032 253 253 000 Old_age
Always - 0
194 Temperature_Celsius 0x0032 253 253 000 Old_age
Always - 40
195 Hardware_ECC_Recovered 0x000a 253 252 000 Old_age
Always - 913
196 Reallocated_Event_Count 0x0008 253 253 000 Old_age
Offline - 0
197 Current_Pending_Sector 0x0008 253 253 000 Old_age
Offline - 0
198 Offline_Uncorrectable 0x0008 253 253 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0008 199 199 000 Old_age
Offline - 0
200 Multi_Zone_Error_Rate 0x000a 253 252 000 Old_age
Always - 0
201 Soft_Read_Error_Rate 0x000a 253 252 000 Old_age
Always - 6
202 TA_Increase_Count 0x000a 253 252 000 Old_age
Always - 0
203 Run_Out_Cancel 0x000b 253 252 180 Pre-fail
Always - 0
204 Shock_Count_Write_Opern 0x000a 253 252 000 Old_age
Always - 0
205 Shock_Rate_Write_Opern 0x000a 253 252 000 Old_age
Always - 0
207 Spin_High_Current 0x002a 252 252 000 Old_age
Always - 0
208 Spin_Buzz 0x002a 252 252 000 Old_age
Always - 0
209 Offline_Seek_Performnce 0x0024 194 190 000 Old_age
Offline - 0
99 Unknown_Attribute 0x0004 253 253 000 Old_age
Offline - 0
100 Unknown_Attribute 0x0004 253 253 000 Old_age
Offline - 0
101 Unknown_Attribute 0x0004 253 253 000 Old_age
Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00%
5394 -
# 2 Extended offline Completed without error 00%
5236 -
# 3 Extended offline Completed without error 00%
5079 -
# 4 Extended offline Completed without error 00%
4922 -
# 5 Extended offline Completed without error 00%
4765 -
# 6 Extended offline Completed without error 00%
4608 -
# 7 Extended offline Completed without error 00%
4451 -
# 8 Extended offline Completed without error 00%
4294 -
# 9 Extended offline Completed without error 00%
4138 -
#10 Extended offline Completed without error 00%
3981 -
#11 Extended offline Completed without error 00%
3824 -
#12 Extended offline Completed without error 00%
3667 -
#13 Extended offline Completed without error 00%
3510 -
#14 Extended offline Completed without error 00%
3353 -
#15 Extended offline Completed without error 00%
3196 -
#16 Extended offline Completed without error 00%
3039 -
#17 Extended offline Completed without error 00%
2882 -
#18 Extended offline Completed without error 00%
2726 -
#19 Extended offline Completed without error 00%
2569 -
#20 Extended offline Completed without error 00%
2412 -
#21 Extended offline Completed without error 00%
2255 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
cerise at armory.com wrote:
> Given that one of my drives had a similar problem not so long ago and it
> managed around 2 years of hard use as my media server, I tend towards agreement.
>
> I haven't had any luck with SMART though. I've never had an inkling that a
> problem was on the way using smartd, nor have I ever identified a problem from
> querying SMART values after the fact. Your mileage may vary.
>
> It is, in any case, why running LVM with striping or an actual RAID is a
> wise thing to do.
>
> -Phil/CERisE
>
> On Mon, Oct 02, 2006 at 12:46:01PM -0700, Thomas Leavitt wrote:
>
>> In my experience, your typical cheap ass Fry's ATA hard drive has a life
>> expectancy of well under two years under any type of serious load.
>> Backups and redundancy are essential in that scenario, as is monitoring
>> how close to kicking the bucket the damn thing is with smartd.
>>
>> Thomas
>>
>> Peter Belew wrote:
>>
>>> Hi -
>>>
>>> On 10/2/06, Eric Carter <Ecnassianer at greenstorm.net> wrote:
>>>
>>>
>>>> So my recent install of Ubuntu ended rather catastrophically. I came out of
>>>> my room early in the morning to the sound of a very tiny sword coming out of
>>>> a very tiny sheath over and over again. It was coming from my computer...
>>>> from the drive I installed Ubuntu on.... my computer wouldn't wake up... I
>>>> powered it off and went to work.
>>>>
>>>> When I got enough time to deal with the situation with the care and
>>>> compassion that I desire my computers to be dealt with I found that the
>>>> drive boots up and seems fine, but occasionally makes evil noises. The drive
>>>> is on my desk, and I've had no indication of data failure. I'd like to
>>>> salvage all the work I put into getting Ubuntu set up. A new (HUGE!) drive
>>>> is on it's way. What's the easiest way to salvage my Ubuntu install when it
>>>> arrives?
>>>>
>>>>
>>> I would get ahold of a USB drive or another networked computer and
>>> backup your home directory (-ies) and other data (including any
>>> config files that were a lot of work to set up, and datbases) ASAP.
>>>
>>> Then reinstall from scratch on the new drive, and restore your data.
>>>
>>> I typically use a separate /home partition, on the theory that this
>>> facilitates the future installation of a different distro, etc.
>>>
>>> Peter
>>>
>>>
>>>> Partition table looks something like this:
>>>> ~1 Gig Swap
>>>> ~29 Gigs EXT 3 Maps to: /
>>>>
>>>> Thanks in advance,
>>>> EC
>>>>
>>>>
>>>> _______________________________________________
>>>> Sluglug mailing list
>>>> Sluglug at sluglug.ucsc.edu
>>>> http://sluglug.ucsc.edu/cgi-bin/mailman/listinfo/sluglug
>>>>
>>>>
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> Sluglug mailing list
>>> Sluglug at sluglug.ucsc.edu
>>> http://sluglug.ucsc.edu/cgi-bin/mailman/listinfo/sluglug
>>>
>>>
>> _______________________________________________
>> Sluglug mailing list
>> Sluglug at sluglug.ucsc.edu
>> http://sluglug.ucsc.edu/cgi-bin/mailman/listinfo/sluglug
>>
> _______________________________________________
> Sluglug mailing list
> Sluglug at sluglug.ucsc.edu
> http://sluglug.ucsc.edu/cgi-bin/mailman/listinfo/sluglug
>
More information about the Sluglug
mailing list