LINUX.ORG.RU
ФорумAdmin

Диск накрывается?

 


0

3

Здравствуйте.

Посмотрите смарт, пожалуйста.

Диск накрывается?

root@fserver:~# smartctl -a /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-128-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     Patriot P200 1TB
Serial Number:    AA000000000000000054
Firmware Version: S0424A0
User Capacity:    1,024,209,543,168 bytes [1.02 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Dec 15 14:36:04 2020 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
                                        entering power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  10) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   100   100   050    Old_age   Always       -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   050    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   050    Old_age   Always       -       7594
 12 Power_Cycle_Count       0x0032   100   100   050    Old_age   Always       -       79
160 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
161 Unknown_Attribute       0x0033   100   100   050    Pre-fail  Always       -       100
163 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       20
164 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       22449
165 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       75
166 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       5
167 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       46
168 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       7000
169 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       100
175 Program_Fail_Count_Chip 0x0032   100   100   050    Old_age   Always       -       0
176 Erase_Fail_Count_Chip   0x0032   100   100   050    Old_age   Always       -       0
177 Wear_Leveling_Count     0x0032   100   100   050    Old_age   Always       -       0
178 Used_Rsvd_Blk_Cnt_Chip  0x0032   100   100   050    Old_age   Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   050    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   050    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   050    Old_age   Always       -       45
194 Temperature_Celsius     0x0022   100   100   050    Old_age   Always       -       40
195 Hardware_ECC_Recovered  0x0032   100   100   050    Old_age   Always       -       4458908
196 Reallocated_Event_Count 0x0032   100   100   050    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   050    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0032   100   100   050    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   050    Old_age   Always       -       1
232 Available_Reservd_Space 0x0032   100   100   050    Old_age   Always       -       100
241 Total_LBAs_Written      0x0030   100   100   050    Old_age   Offline      -       260562
242 Total_LBAs_Read         0x0030   100   100   050    Old_age   Offline      -       2933040
245 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       749952

SMART Error Log Version: 1
Warning: ATA error count 0 inconsistent with error log pointer 1

ATA Error Count: 0
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error -4 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 00 00 00 00 00 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  b0 d0 01 00 4f c2 00 08      00:00:00.000  SMART READ DATA
  b0 d1 01 01 4f c2 00 08      00:00:00.000  SMART READ ATTRIBUTE THRESHOLDS [OBS-4]
  b0 da 00 00 4f c2 00 08      00:00:00.000  SMART RETURN STATUS
  b0 d5 01 00 4f c2 00 08      00:00:00.000  SMART READ LOG
  b0 d5 01 01 4f c2 00 08      00:00:00.000  SMART READ LOG

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      7594         -
# 2  Extended offline    Completed without error       00%      7592         -
# 3  Extended offline    Interrupted (host reset)      90%      7592         -
# 4  Extended offline    Completed without error       00%      7522         -
# 5  Extended offline    Completed without error       00%      7493         -
# 6  Short offline       Completed without error       00%      7492         -

Ещё один ssd

https://bit.ly/2JZiguC


Шлейфа тыкал?

У меня раз полетел элемент обвязки контроллера. Пока определил неисправность, чуть не помер. :-D Я в том ключе, что не факт, что поломался диск. Не спеши делать выводы. Там, кстати, явно не очевидно, что диск умирает.

С прошивками такого плана до сего времени работать не приходилось, но может быть дело в прошивке.

anonymous ()
Ответ на: комментарий от anonymous
root@fserver:~# smartctl -a /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-128-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Silicon Motion based SSDs
Device Model:     Patriot P200 1TB
Serial Number:    AA000000000000000054
Firmware Version: S0424A0
User Capacity:    1,024,209,543,168 bytes [1.02 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Dec 15 15:07:22 2020 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
                                        entering power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  10) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   100   100   050    Old_age   Always       -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   050    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   050    Old_age   Always       -       7594
 12 Power_Cycle_Count       0x0032   100   100   050    Old_age   Always       -       79
160 Uncorrectable_Error_Cnt 0x0032   100   100   050    Old_age   Always       -       0
161 Valid_Spare_Block_Cnt   0x0033   100   100   050    Pre-fail  Always       -       100
163 Initial_Bad_Block_Count 0x0032   100   100   050    Old_age   Always       -       20
164 Total_Erase_Count       0x0032   100   100   050    Old_age   Always       -       22449
165 Max_Erase_Count         0x0032   100   100   050    Old_age   Always       -       75
166 Min_Erase_Count         0x0032   100   100   050    Old_age   Always       -       5
167 Average_Erase_Count     0x0032   100   100   050    Old_age   Always       -       46
168 Max_Erase_Count_of_Spec 0x0032   100   100   050    Old_age   Always       -       7000
169 Remaining_Lifetime_Perc 0x0032   100   100   050    Old_age   Always       -       100
175 Program_Fail_Count_Chip 0x0032   100   100   050    Old_age   Always       -       0
176 Erase_Fail_Count_Chip   0x0032   100   100   050    Old_age   Always       -       0
177 Wear_Leveling_Count     0x0032   100   100   050    Old_age   Always       -       0
178 Runtime_Invalid_Blk_Cnt 0x0032   100   100   050    Old_age   Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   050    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   050    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   050    Old_age   Always       -       45
194 Temperature_Celsius     0x0022   100   100   050    Old_age   Always       -       40
195 Hardware_ECC_Recovered  0x0032   100   100   050    Old_age   Always       -       4458908
196 Reallocated_Event_Count 0x0032   100   100   050    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   050    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0032   100   100   050    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   050    Old_age   Always       -       1
232 Available_Reservd_Space 0x0032   100   100   050    Old_age   Always       -       100
241 Host_Writes_32MiB       0x0030   100   100   050    Old_age   Offline      -       260562
242 Host_Reads_32MiB        0x0030   100   100   050    Old_age   Offline      -       2933040
245 TLC_Writes_32MiB        0x0032   100   100   050    Old_age   Always       -       749952

SMART Error Log Version: 1
Warning: ATA error count 0 inconsistent with error log pointer 1

ATA Error Count: 0
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error -4 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 00 00 00 00 00 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  b0 d0 01 00 4f c2 00 08      00:00:00.000  SMART READ DATA
  b0 d1 01 01 4f c2 00 08      00:00:00.000  SMART READ ATTRIBUTE THRESHOLDS [OBS-4]
  b0 da 00 00 4f c2 00 08      00:00:00.000  SMART RETURN STATUS
  b0 d5 01 00 4f c2 00 08      00:00:00.000  SMART READ LOG
  b0 d5 01 01 4f c2 00 08      00:00:00.000  SMART READ LOG

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      7594         -
# 2  Extended offline    Completed without error       00%      7592         -
# 3  Extended offline    Interrupted (host reset)      90%      7592         -
# 4  Extended offline    Completed without error       00%      7522         -
# 5  Extended offline    Completed without error       00%      7493         -
# 6  Short offline       Completed without error       00%      7492         -

Selective Self-tests/Logging not supported

INDIGO ()
Ответ на: комментарий от anonymous

Симптомы такие:

В компе:

3 ssd, собранные в софтовый RAID0 и смонтированый в /storage1

1 nvme диск, смонтирован /storage3

При копирование больших файлов (15GB) из /storage3 в /storage1, не совпадает контрольная сумма.


root@fserver:~# md5sum /storage1/b85pr4_0.GHO
22166acda60db094e7538254b0ac4416  /storage1/b85pr4_0.GHO
root@fserver:~# md5sum /storage3/b85pr4_0.GHO
70fb16a9a21435cc4ed25ae9a56fa0e4  /storage3/b85pr4_0.GHO
INDIGO ()
Ответ на: комментарий от anonymous

Комп работал год.

/storage1 xfs RAID0

/storage3 xfs

root@fserver:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            7.8G     0  7.8G   0% /dev
tmpfs           1.6G  2.4M  1.6G   1% /run
/dev/sda1       220G   44G  165G  22% /
tmpfs           7.8G     0  7.8G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           7.8G     0  7.8G   0% /sys/fs/cgroup
/dev/nvme0n1p1  477G   14G  463G   3% /storage3
/dev/md0        2.8T   17G  2.8T   1% /storage1
tmpfs           1.6G     0  1.6G   0% /run/user/1000
root@fserver:/storage1/storage1# mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Tue Nov 19 16:26:29 2019
        Raid Level : raid0
        Array Size : 3000213504 (2861.23 GiB 3072.22 GB)
      Raid Devices : 3
     Total Devices : 3
       Persistence : Superblock is persistent

       Update Time : Tue Nov 19 16:26:29 2019
             State : clean
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 0
     Spare Devices : 0

        Chunk Size : 512K

Consistency Policy : none

              Name : fserver:0  (local to host fserver)
              UUID : a0efb3f0:f709ff4e:db80c3ac:3a35cc8c
            Events : 0

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
INDIGO ()
Последнее исправление: INDIGO (всего исправлений: 1)
Ответ на: комментарий от pinus_nigra

Память проверил в первую очередь. Не работала на штатной частоте. Заменил.

Нашел в чём проблема…

Разобрал raid0 и проверил копирование файла на каждый диск….

На /dev/sdb1 ошибки при копировании и проявились….

Похоже это оно….

Но смарт то был хороший! Как так?

INDIGO ()
Ответ на: комментарий от INDIGO

СМАРТ это типа программа для снятия статистики и информации с датчиков. Типа прошивки. Так вот у датчиков могут быть состояния, которые могут не фиксироваться прошивкой. Либо информация с датчиков может быть интерпретирована ложно. СМАРТ – идёт как средство предварительной диагностики.

Принтерами занимался когда-либо? Бывает шляпа, особенно с современными принтерами, когда вот принтер вроде неисправен. И надо бы в ремонт. Но накатываешь свежую прошиву и оно работает? Почему? Причина неисправности в небольшом, на микронном уровне износе и люфте движущихся частей. Датчики попадают в слепую зону и на процессор идут команды об ошибках. Новая прошива это исправляет, либо вносит коррективы в сторону уменьшения точности обработки датчиками входной информации. Один чувак даже модифицировал прошивку из-за того, что новая прошивка стоила больших денег (печатно-издательское оборудование) и модификация оказалась меньшим геморроем, чем покупка подписки на софт производителя. Но для потребительского сегмента этим никто заниматься не будет, усилия дороже, чем стоимость готового девайса. А если парк принтеров овер 100500 единиц? Тогда покупка нового оборачивается в очуменную расходную статью. Тема заинтересовала и ваяю прикладную софтину под это дело. Но я отвлёкся. Я хочу сказать, что СМАРТ ещё не окончательная инстанция. Так вот может быть тебе освежить прошивку?

anonymous ()