LINUX.ORG.RU

Сообщения Livito

 

Ошибка восстановления raid-1

Добрый день! Имею сервер с 4 дисками. По 2 диска собраны в 2 программных raid1 массива /dev/md0 (/dev/sda7+/dev/sdb3) и /dev/md1 (/dev/sdc1+/dev/sdd1). Эти raid массивы объединены в логическую группу LVM, в которой создан один логический том с файловой системой xfs, занимающий весь объем.

Недавно вышел из строя один из дисков (/dev/sdd1) и перестал определяться в системе. Был куплен абсолютно такой же диск и установлен в сервер

Скопировал таблицу разделов на новый диск с рабочего командой

sfdisk -d /dev/sdc | sfdisk /dev/sdd

Добавил новый диск в /dev/md1 командой

mdadm --manage /dev/md1 --add /dev/sdd1

После перезагрузки начинается восстановление, но при достижении 2,3% восстановление обрывается. Диск /dev/sdd1 принимает статус SPARE.

В dmesg выходит ошибка чтения с диска /dev/sdc1.

SMART диска /dev/sdc1

smartctl version 5.36 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     MB2000EAMZF
Serial Number:    9WM0ETAB
Firmware Version: HPG1
User Capacity:    2▒000▒398▒934▒016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Not recognized. Minor revision code: 0x28
Local Time is:    Wed Oct 11 11:29:24 2017 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 ( 609) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   3) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   070   056   044    Pre-fail  Always       -       25835348947
  3 Spin_Up_Time            0x0003   093   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       76
  5 Reallocated_Sector_Ct   0x0033   096   096   036    Pre-fail  Always       -       168
  7 Seek_Error_Rate         0x000f   085   060   030    Pre-fail  Always       -       369643805
  9 Power_On_Hours          0x0032   041   041   000    Old_age   Always       -       51821
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       78
180 Unknown_Attribute       0x003b   100   100   000    Pre-fail  Always       -       420267169
184 Unknown_Attribute       0x0032   100   100   003    Old_age   Always       -       0
187 Unknown_Attribute       0x0032   085   085   000    Old_age   Always       -       15
188 Unknown_Attribute       0x0032   100   097   000    Old_age   Always       -       26
189 Unknown_Attribute       0x003a   099   099   000    Old_age   Always       -       1
190 Unknown_Attribute       0x0022   061   055   045    Old_age   Always       -       690094119
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       38
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       78
194 Temperature_Celsius     0x0022   039   045   000    Old_age   Always       -       39 (Lifetime Min/Max 0/20)
195 Hardware_ECC_Recovered  0x001a   048   018   000    Old_age   Always       -       65545171
196 Reallocated_Event_Count 0x0033   096   096   036    Pre-fail  Always       -       168
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 174 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 174 occurred at disk power-on lifetime: 51820 hours (2159 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 65 70 20 05

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 01 70 20 45 00   8d+00:50:15.291  [RESERVED FOR SERIAL ATA]
  ef 10 02 00 00 00 a0 00   8d+00:50:15.291  SET FEATURES [Reserved for Serial ATA]
  ec 00 00 00 00 00 a0 00   8d+00:50:15.290  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00   8d+00:50:15.290  SET FEATURES [Set transfer mode]
  ef 10 02 00 00 00 a0 00   8d+00:50:15.290  SET FEATURES [Reserved for Serial ATA]

Error 173 occurred at disk power-on lifetime: 51820 hours (2159 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 65 70 20 05

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 01 70 20 45 00   8d+00:50:12.635  [RESERVED FOR SERIAL ATA]
  ef 10 02 00 00 00 a0 00   8d+00:50:12.635  SET FEATURES [Reserved for Serial ATA]
  ec 00 00 00 00 00 a0 00   8d+00:50:12.634  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00   8d+00:50:12.634  SET FEATURES [Set transfer mode]
  ef 10 02 00 00 00 a0 00   8d+00:50:12.633  SET FEATURES [Reserved for Serial ATA]

Error 172 occurred at disk power-on lifetime: 51820 hours (2159 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 65 70 20 05

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 01 70 20 45 00   8d+00:50:09.971  [RESERVED FOR SERIAL ATA]
  60 00 80 01 71 20 45 00   8d+00:50:09.970  [RESERVED FOR SERIAL ATA]
  ef 10 02 00 00 00 a0 00   8d+00:50:09.970  SET FEATURES [Reserved for Serial ATA]
  ec 00 00 00 00 00 a0 00   8d+00:50:09.969  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00   8d+00:50:09.969  SET FEATURES [Set transfer mode]

Error 171 occurred at disk power-on lifetime: 51820 hours (2159 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 65 70 20 05

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 01 71 20 45 00   8d+00:50:07.314  [RESERVED FOR SERIAL ATA]
  60 00 80 01 70 20 45 00   8d+00:50:07.314  [RESERVED FOR SERIAL ATA]
  ef 10 02 00 00 00 a0 00   8d+00:50:07.314  SET FEATURES [Reserved for Serial ATA]
  ec 00 00 00 00 00 a0 00   8d+00:50:07.313  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00   8d+00:50:07.313  SET FEATURES [Set transfer mode]

Error 170 occurred at disk power-on lifetime: 51820 hours (2159 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 65 70 20 05

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 01 70 20 45 00   8d+00:50:04.631  [RESERVED FOR SERIAL ATA]
  60 00 80 01 71 20 45 00   8d+00:50:04.630  [RESERVED FOR SERIAL ATA]
  60 00 80 81 71 20 45 00   8d+00:50:04.630  [RESERVED FOR SERIAL ATA]
  60 00 80 01 72 20 45 00   8d+00:50:04.630  [RESERVED FOR SERIAL ATA]
  60 00 80 01 7c 20 45 00   8d+00:50:04.629  [RESERVED FOR SERIAL ATA]

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

В дополнение привожу состояние /dev/md1

[root@e5 mapper]# mdadm -D /dev/md1
/dev/md1:
        Version : 00.90.03
  Creation Time : Thu Oct 27 14:33:11 2011
     Raid Level : raid1
     Array Size : 1863013184 (1776.71 GiB 1907.73 GB)
  Used Dev Size : 1863013184 (1776.71 GiB 1907.73 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Wed Oct 11 11:38:49 2017
          State : clean, degraded
 Active Devices : 1
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 1

           UUID : 2bafe702:89d5e11f:da4519c3:ba339ffc
         Events : 0.20868067

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       2       8       49        1      spare rebuilding   /dev/sdd1

Подскажите, как можно вернуть в рабочее состояние массив /dev/md1?

 , ,

Livito
()

RSS подписка на новые темы