[SlugLUG] When Bad Hard Drives Happen to Good Linuxes

Thomas Leavitt thomas at thomasleavitt.org
Mon Oct 2 13:02:14 PDT 2006


Hmm... SMART doesn't generally tell me when a drive is going to die, but
if I see problems, it is a signal to dump the drive... for instance,
here's the output on my server's current main drive, which I'm migrating
off of within a day or two...

I typically see problems here:

195 Hardware_ECC_Recovered  0x000a   253   252   000    Old_age  
Always       -
       15205
196 Reallocated_Event_Count 0x0008   183   183   000    Old_age  
Offline      -
       70
197 Current_Pending_Sector  0x0008   248   247   000    Old_age  
Offline      -
       51
198 Offline_Uncorrectable   0x0008   231   193   000    Old_age  
Offline      -
       22

prior to a disk failure... now, the error log below says nothing much
has happened recently, but my instincts to date indicate that this drive
is suspect, and has been for a while...



SMART Error Log Version: 1
Warning: ATA error count 1909 inconsistent with error log pointer 5

ATA Error Count: 1909 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1909 occurred at disk power-on lifetime: 12996 hours (541 days +
12 hours)
  When the command that caused the error occurred, the device was in an
unknown
state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 3b fa 2d 20 e0  Error: UNC 59 sectors at LBA = 0x00202dfa = 2108922

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 40 fa 2d 20 e0 08      00:09:27.280  READ DMA EXT
  25 00 40 fa 2d 20 e0 08      00:09:24.432  READ DMA EXT
  25 00 08 f2 2d 20 e0 08      00:09:24.432  READ DMA EXT
  25 00 40 fa 2d 1c e0 08      00:09:24.432  READ DMA EXT
  25 00 08 f2 2d 1c e0 08      00:09:24.432  READ DMA EXT

Error 1908 occurred at disk power-on lifetime: 12996 hours (541 days +
12 hours)
  When the command that caused the error occurred, the device was in an
unknown
state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  01 51 3b fa 2d 20 e0  Error: AMNF 59 sectors at LBA = 0x00202dfa = 2108922

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 40 fa 2d 20 e0 08      00:09:24.432  READ DMA EXT
  25 00 08 f2 2d 20 e0 08      00:09:24.432  READ DMA EXT
  25 00 40 fa 2d 1c e0 08      00:09:24.432  READ DMA EXT
  25 00 08 f2 2d 1c e0 08      00:09:24.432  READ DMA EXT
  25 00 40 fa 2d 18 e0 08      00:09:24.432  READ DMA EXT

Error 1907 occurred at disk power-on lifetime: 11073 hours (461 days + 9
hours)
  When the command that caused the error occurred, the device was in an
unknown
state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 06 1b 35 74 e0  Error: UNC 6 sectors at LBA = 0x0074351b = 7615771

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 1b 35 74 e0 08   8d+18:00:20.848  READ DMA EXT
  25 00 08 6b 87 71 e0 08   8d+18:00:20.848  READ DMA EXT
  35 00 08 7b 0f 07 e0 08   8d+18:00:20.816  WRITE DMA EXT
  35 00 f8 83 0e 07 e0 08   8d+18:00:20.816  WRITE DMA EXT
  35 00 08 fb 3f 2e e0 08   8d+18:00:20.816  WRITE DMA EXT

Error 1906 occurred at disk power-on lifetime: 6426 hours (267 days + 18
hours)
  When the command that caused the error occurred, the device was in an
unknown
state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 02 f6 2d 84 e0  Error: UNC 2 sectors at LBA = 0x00842df6 = 8662518

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 02 f6 2d 84 e0 08      00:21:24.512  READ DMA EXT
  25 00 04 f4 2d 84 e0 08      00:21:23.296  READ DMA EXT
  25 00 06 f2 2d 84 e0 08      00:21:22.064  READ DMA EXT
  35 00 08 fa 2d 2c e0 08      00:21:22.064  WRITE DMA EXT
  25 00 02 f8 2d 84 e0 08      00:21:22.048  READ DMA EXT


smartctl version 5.33 [i586-mandriva-linux-gnu] Copyright (C) 2002-4
Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     Maxtor 6Y250P0
Serial Number:    Y64B29GE
Firmware Version: YAR41BW0
User Capacity:    251,000,193,024 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
Local Time is:    Mon Oct  2 13:00:07 2006 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x85) Offline data collection activity
                                        was aborted by an interrupting
command f
rom host.
                                        Auto Offline Data Collection:
Enabled.
Self-test execution status:      ( 114) The previous self-test completed
having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                 ( 363) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection
on/off supp
ort.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        No General Purpose Logging support.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 106) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE     
UPDATED  WHEN_
FAILED RAW_VALUE
  3 Spin_Up_Time            0x0027   181   181   063    Pre-fail 
Always       -
       10437
  4 Start_Stop_Count        0x0032   253   253   000    Old_age  
Always       -
       35
  5 Reallocated_Sector_Ct   0x0033   245   244   063    Pre-fail 
Always       -
       83
  6 Read_Channel_Margin     0x0001   253   253   100    Pre-fail 
Offline      -
       0
  7 Seek_Error_Rate         0x000a   253   252   000    Old_age  
Always       -
       0
  8 Seek_Time_Performance   0x0027   250   243   187    Pre-fail 
Always       -
       42539
  9 Power_On_Minutes        0x0032   209   209   000    Old_age  
Always       -
       221h+29m
 10 Spin_Retry_Count        0x002b   253   252   157    Pre-fail 
Always       -
       0
 11 Calibration_Retry_Count 0x002b   253   252   223    Pre-fail 
Always       -
       0
 12 Power_Cycle_Count       0x0032   253   253   000    Old_age  
Always       -
       255
192 Power-Off_Retract_Count 0x0032   253   253   000    Old_age  
Always       -
       0
193 Load_Cycle_Count        0x0032   253   253   000    Old_age  
Always       -
       0
194 Temperature_Celsius     0x0032   253   253   000    Old_age  
Always       -
       43
195 Hardware_ECC_Recovered  0x000a   253   252   000    Old_age  
Always       -
       15205
196 Reallocated_Event_Count 0x0008   183   183   000    Old_age  
Offline      -
       70
197 Current_Pending_Sector  0x0008   248   247   000    Old_age  
Offline      -
       51
198 Offline_Uncorrectable   0x0008   231   193   000    Old_age  
Offline      -
       22
199 UDMA_CRC_Error_Count    0x0008   199   199   000    Old_age  
Offline      -
       0
200 Multi_Zone_Error_Rate   0x000a   253   252   000    Old_age  
Always       -
       0
201 Soft_Read_Error_Rate    0x000a   253   252   000    Old_age  
Always       -
       4
202 TA_Increase_Count       0x000a   253   183   000    Old_age  
Always       -
       0
203 Run_Out_Cancel          0x000b   253   252   180    Pre-fail 
Always       -
       3
204 Shock_Count_Write_Opern 0x000a   253   252   000    Old_age  
Always       -
       0
205 Shock_Rate_Write_Opern  0x000a   253   252   000    Old_age  
Always       -
       0
207 Spin_High_Current       0x002a   253   252   000    Old_age  
Always       -
       0
208 Spin_Buzz               0x002a   253   252   000    Old_age  
Always       -
       0
209 Offline_Seek_Performnce 0x0024   195   189   000    Old_age  
Offline      -
       0
 99 Unknown_Attribute       0x0004   253   253   000    Old_age  
Offline      -
       0
100 Unknown_Attribute       0x0004   253   253   000    Old_age  
Offline      -
       0
101 Unknown_Attribute       0x0004   253   253   000    Old_age  
Offline      -
       0

the other drive is not in use, and shows no errors...

smartctl version 5.33 [i586-mandriva-linux-gnu] Copyright (C) 2002-4
Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     Maxtor 6Y200P0
Serial Number:    Y66QQVSE
Firmware Version: YAR41BW0
User Capacity:    203,928,109,056 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
Local Time is:    Mon Oct  2 13:05:13 2006 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection:
Enabled.
Self-test execution status:      (   0) The previous self-test routine
completed
                                        without error or no self-test
has ever
                                        been run.
Total time to complete Offline
data collection:                 ( 362) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection
on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        No General Purpose Logging support.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  88) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE     
UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time            0x0027   252   252   063    Pre-fail 
Always       -       4564
  4 Start_Stop_Count        0x0032   253   253   000    Old_age  
Always       -       14
  5 Reallocated_Sector_Ct   0x0033   253   253   063    Pre-fail 
Always       -       0
  6 Read_Channel_Margin     0x0001   253   253   100    Pre-fail 
Offline      -       0
  7 Seek_Error_Rate         0x000a   253   252   000    Old_age  
Always       -       0
  8 Seek_Time_Performance   0x0027   252   249   187    Pre-fail 
Always       -       48471
  9 Power_On_Minutes        0x0032   237   237   000    Old_age  
Always       -       348h+29m
 10 Spin_Retry_Count        0x002b   252   252   157    Pre-fail 
Always       -       0
 11 Calibration_Retry_Count 0x002b   253   252   223    Pre-fail 
Always       -       0
 12 Power_Cycle_Count       0x0032   253   253   000    Old_age  
Always       -       16
192 Power-Off_Retract_Count 0x0032   253   253   000    Old_age  
Always       -       0
193 Load_Cycle_Count        0x0032   253   253   000    Old_age  
Always       -       0
194 Temperature_Celsius     0x0032   253   253   000    Old_age  
Always       -       40
195 Hardware_ECC_Recovered  0x000a   253   252   000    Old_age  
Always       -       913
196 Reallocated_Event_Count 0x0008   253   253   000    Old_age  
Offline      -       0
197 Current_Pending_Sector  0x0008   253   253   000    Old_age  
Offline      -       0
198 Offline_Uncorrectable   0x0008   253   253   000    Old_age  
Offline      -       0
199 UDMA_CRC_Error_Count    0x0008   199   199   000    Old_age  
Offline      -       0
200 Multi_Zone_Error_Rate   0x000a   253   252   000    Old_age  
Always       -       0
201 Soft_Read_Error_Rate    0x000a   253   252   000    Old_age  
Always       -       6
202 TA_Increase_Count       0x000a   253   252   000    Old_age  
Always       -       0
203 Run_Out_Cancel          0x000b   253   252   180    Pre-fail 
Always       -       0
204 Shock_Count_Write_Opern 0x000a   253   252   000    Old_age  
Always       -       0
205 Shock_Rate_Write_Opern  0x000a   253   252   000    Old_age  
Always       -       0
207 Spin_High_Current       0x002a   252   252   000    Old_age  
Always       -       0
208 Spin_Buzz               0x002a   252   252   000    Old_age  
Always       -       0
209 Offline_Seek_Performnce 0x0024   194   190   000    Old_age  
Offline      -       0
 99 Unknown_Attribute       0x0004   253   253   000    Old_age  
Offline      -       0
100 Unknown_Attribute       0x0004   253   253   000    Old_age  
Offline      -       0
101 Unknown_Attribute       0x0004   253   253   000    Old_age  
Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining 
LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     
5394         -
# 2  Extended offline    Completed without error       00%     
5236         -
# 3  Extended offline    Completed without error       00%     
5079         -
# 4  Extended offline    Completed without error       00%     
4922         -
# 5  Extended offline    Completed without error       00%     
4765         -
# 6  Extended offline    Completed without error       00%     
4608         -
# 7  Extended offline    Completed without error       00%     
4451         -
# 8  Extended offline    Completed without error       00%     
4294         -
# 9  Extended offline    Completed without error       00%     
4138         -
#10  Extended offline    Completed without error       00%     
3981         -
#11  Extended offline    Completed without error       00%     
3824         -
#12  Extended offline    Completed without error       00%     
3667         -
#13  Extended offline    Completed without error       00%     
3510         -
#14  Extended offline    Completed without error       00%     
3353         -
#15  Extended offline    Completed without error       00%     
3196         -
#16  Extended offline    Completed without error       00%     
3039         -
#17  Extended offline    Completed without error       00%     
2882         -
#18  Extended offline    Completed without error       00%     
2726         -
#19  Extended offline    Completed without error       00%     
2569         -
#20  Extended offline    Completed without error       00%     
2412         -
#21  Extended offline    Completed without error       00%     
2255         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



cerise at armory.com wrote:
> Given that one of my drives had a similar problem not so long ago and it 
> managed around 2 years of hard use as my media server, I tend towards agreement.
>
> I haven't had any luck with SMART though.  I've never had an inkling that a 
> problem was on the way using smartd, nor have I ever identified a problem from
> querying SMART values after the fact.  Your mileage may vary.
>
> It is, in any case, why running LVM with striping or an actual RAID is a
> wise thing to do.
>
> -Phil/CERisE
>
> On Mon, Oct 02, 2006 at 12:46:01PM -0700, Thomas Leavitt wrote:
>   
>> In my experience, your typical cheap ass Fry's ATA hard drive has a life
>> expectancy of well under two years under any type of serious load.
>> Backups and redundancy are essential in that scenario, as is monitoring
>> how close to kicking the bucket the damn thing is with smartd.
>>
>> Thomas
>>
>> Peter Belew wrote:
>>     
>>> Hi -
>>>
>>> On 10/2/06, Eric Carter <Ecnassianer at greenstorm.net> wrote:
>>>   
>>>       
>>>> So my recent install of Ubuntu ended rather catastrophically. I came out of
>>>> my room early in the morning to the sound of a very tiny sword coming out of
>>>> a very tiny sheath over and over again. It was coming from my computer...
>>>> from the drive I installed Ubuntu on.... my computer wouldn't wake up... I
>>>> powered it off and went to work.
>>>>
>>>> When I got enough time to deal with the situation with the care and
>>>> compassion that I desire my computers to be dealt with I found that the
>>>> drive boots up and seems fine, but occasionally makes evil noises. The drive
>>>> is on my desk, and I've had no indication of data failure. I'd like to
>>>> salvage all the work I put into getting Ubuntu set up. A new (HUGE!) drive
>>>> is on it's way. What's the easiest way to salvage my Ubuntu install when it
>>>> arrives?
>>>>     
>>>>         
>>> I would get ahold of a USB drive or another networked computer and
>>> backup your home directory (-ies) and other data (including any
>>> config files that were a lot of work to set up, and datbases) ASAP.
>>>
>>> Then reinstall from scratch on the new drive, and restore your data.
>>>
>>> I typically use a separate /home partition, on the theory that this
>>> facilitates the future installation of a different distro, etc.
>>>
>>> Peter
>>>   
>>>       
>>>> Partition table looks something like this:
>>>> ~1 Gig Swap
>>>> ~29 Gigs EXT 3 Maps to: /
>>>>
>>>> Thanks in advance,
>>>> EC
>>>>
>>>>
>>>> _______________________________________________
>>>> Sluglug mailing list
>>>> Sluglug at sluglug.ucsc.edu
>>>> http://sluglug.ucsc.edu/cgi-bin/mailman/listinfo/sluglug
>>>>
>>>>
>>>>
>>>>     
>>>>         
>>> _______________________________________________
>>> Sluglug mailing list
>>> Sluglug at sluglug.ucsc.edu
>>> http://sluglug.ucsc.edu/cgi-bin/mailman/listinfo/sluglug
>>>   
>>>       
>> _______________________________________________
>> Sluglug mailing list
>> Sluglug at sluglug.ucsc.edu
>> http://sluglug.ucsc.edu/cgi-bin/mailman/listinfo/sluglug
>>     
> _______________________________________________
> Sluglug mailing list
> Sluglug at sluglug.ucsc.edu
> http://sluglug.ucsc.edu/cgi-bin/mailman/listinfo/sluglug
>   



More information about the Sluglug mailing list