FreeNAS 9.3 ディスク換装

事象

Error: UNC at LBA = 0x0fffffff = 268435455 のようなエラーが多発し、読み取りに非常に時間がかかるような症状となる。かれこれ3回目の交換になる。半年に1回くらい交換しているような気がします。
利用している HDD は Seagate 社の ST3000DM001-9YN166 で 7200 rpm と高速なことから選択したが故障率は高めかもしれない。個体差があるといったほうがよいか。壊れたのは中国生産の個体であったが、smartctl で状況を確認すると 2 年は動いていた様子なのでそこまで悪くはないのだろうか。

# smartctl -a -q noserial /dev/ada1
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p31 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST3000DM001-9YN166
Firmware Version: CC4B
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Oct  3 17:23:53 2016 JST

==> WARNING: A firmware update for this drive may be available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/223651en

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  575) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 329) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   084   071   006    Pre-fail  Always       -       1131277
  3 Spin_Up_Time            0x0003   093   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       13
  5 Reallocated_Sector_Ct   0x0033   096   051   036    Pre-fail  Always       -       6112
  7 Seek_Error_Rate         0x000f   067   060   030    Pre-fail  Always       -       610669677384
  9 Power_On_Hours          0x0032   081   081   000    Old_age   Always       -       17401
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       36
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   070   070   099    Old_age   Always   FAILING_NOW 30
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       65535
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       8 8 107
189 High_Fly_Writes         0x003a   043   043   000    Old_age   Always       -       57
190 Airflow_Temperature_Cel 0x0022   055   051   045    Old_age   Always       -       45 (Min/Max 40/48)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       32
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       72
194 Temperature_Celsius     0x0022   045   049   000    Old_age   Always       -       45 (0 16 0 0 0)
197 Current_Pending_Sector  0x0012   088   084   000    Old_age   Always       -       2088
198 Offline_Uncorrectable   0x0010   088   084   000    Old_age   Offline      -       2088
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       28
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       17362h+50m+08.655s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       124291172200740
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       274690123817205

SMART Error Log Version: 1
ATA Error Count: 12773 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 12773 occurred at disk power-on lifetime: 17401 hours (725 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 10 ff ff ff 4f 00  49d+10:03:54.637  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00  49d+10:03:54.637  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00  49d+10:03:54.637  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00  49d+10:03:54.637  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00  49d+10:03:54.637  READ FPDMA QUEUED

Error 12772 occurred at disk power-on lifetime: 17401 hours (725 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 10 ff ff ff 4f 00  49d+10:03:51.423  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00  49d+10:03:51.423  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00  49d+10:03:51.423  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00  49d+10:03:51.423  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00  49d+10:03:51.422  READ FPDMA QUEUED

Error 12771 occurred at disk power-on lifetime: 17401 hours (725 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 10 ff ff ff 4f 00  49d+10:03:48.148  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00  49d+10:03:48.147  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00  49d+10:03:48.147  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00  49d+10:03:48.147  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00  49d+10:03:48.147  READ FPDMA QUEUED

Error 12770 occurred at disk power-on lifetime: 17401 hours (725 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 10 ff ff ff 4f 00  49d+10:03:44.953  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00  49d+10:03:44.953  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00  49d+10:03:44.953  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00  49d+10:03:44.953  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00  49d+10:03:44.953  READ FPDMA QUEUED

Error 12769 occurred at disk power-on lifetime: 17401 hours (725 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 48 ff ff ff 4f 00  49d+10:03:40.498  READ FPDMA QUEUED
  60 00 d8 ff ff ff 4f 00  49d+10:03:40.498  READ FPDMA QUEUED
  61 00 80 ff ff ff 4f 00  49d+10:03:40.498  WRITE FPDMA QUEUED
  60 00 10 ff ff ff 4f 00  49d+10:03:40.497  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00  49d+10:03:40.497  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Interrupted (host reset)      80%      3516         -
# 2  Extended offline    Completed: read failure       50%      3514         3203156864
# 3  Extended offline    Completed: read failure       50%      3511         3203156864
# 4  Extended offline    Completed: read failure       50%      3508         3203156864
# 5  Extended offline    Completed: read failure       50%      3505         3203156864
# 6  Extended offline    Completed: read failure       50%      3502         3203156864
# 7  Extended offline    Completed: read failure       50%      3499         3203156864
# 8  Extended offline    Completed: read failure       50%      3496         3203156864
# 9  Extended offline    Completed: read failure       50%      3493         3203156864
#10  Extended offline    Completed: read failure       50%      3490         3203156864
#11  Extended offline    Completed: read failure       50%      3487         3203156864
#12  Extended offline    Completed: read failure       50%      3484         3203156864
#13  Extended offline    Completed: read failure       50%      3481         3203156864
#14  Extended offline    Completed: read failure       50%      3478         3203156864
#15  Extended offline    Completed: read failure       50%      3475         3203156864
#16  Extended offline    Completed: read failure       50%      3472         3203156864
#17  Extended offline    Completed: read failure       50%      3469         3203156864
#18  Extended offline    Completed: read failure       50%      3466         3203156864
#19  Extended offline    Completed: read failure       50%      3463         3203156864
#20  Extended offline    Completed: read failure       50%      3460         3203156864
#21  Extended offline    Completed: read failure       50%      3457         3203156864

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

対応

Alert アイコンの表示はエラー発生時には赤くなるがチェックを外すとグリーンの状態になり、見た目問題ないように見える。ディスクのステータスもオンラインです。しかし、smartctl の結果から早急に交換したほうが良いと判断。とにかくどのディスクに問題があるかを特定することが先決。Raid 10 の 4 本構成なのであまり悠長にはしていられない。WestanDigital や Seagate だと交換用ディスクも Amazon などで同じ型が簡単に入手できるのは利点。

FreeNAS9.3ディスク交換#1

FreeNAS9.3ディスク交換#1

ディスクをちゃっちゃと交換します。交換するディスクを間違わないこと。その他、交換時にケーブルに損傷がないかの確認や、埃がたまってる場合はエアダスターなどでスプレーしてきれいにしておくと尚良いかと。
[ストレージ]-[ボリューム]で一番上の pool 名を選択し、ボリュームのステータスをクリックすると以下のような画面になるので新しいディスクを選択すれば OK です。

FreeNAS9.3ディスク交換#2

FreeNAS9.3ディスク交換#2

最後に zpool status で状況確認。 resilver に 211 時間かかるようだが、置き換えたあとから読み込みパフォーマンスも改善。とりあえず使いながら様子をみることにします。

FreeNAS9.3ディスク交換#3

FreeNAS9.3ディスク交換#3

スポンサーリンク