Sorry for the long narrative, but I'm thoroughly confused.
I'm using FreeNAS-8.0.4-RELEASE-p2-x64 (11367) on a box with 5x3TB SATA disks configured as a raidz volume.
A few days ago, the console gave me this alert:
CRITICAL: The volume raid-5x3 (ZFS) status is DEGRADED
zpool status
gave:
pool: raid-5x3
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: resilver completed after 3h25m with 7607009 errors on Sun Aug 12 06:26:44 2012
config:
NAME STATE READ WRITE CKSUM
raid-5x3 DEGRADED 0 0 7.29M
raidz1 DEGRADED 0 0 14.7M
ada0p2 ONLINE 0 0 0
10739480653363274060 FAULTED 0 0 0 was /dev/ada1p2
ada2p2 ONLINE 0 0 0
ada3p2 ONLINE 0 0 3 254M resilvered
ada1p2 ONLINE 0 0 0
errors: 7607009 data errors, use '-v' for a list
I did a zpool status -v
and got:
Permanent errors have been detected in the following files:
and it listed 2,660 files (out of 50,000 or so)
plus things like:
raid-5x3/alpha:<0x0>
raid-5x3/alpha:<0xf5ec>
raid-5x3/alpha:<0xf5ea>
We turned the server off, put in a new drive, in addition to the five already in there.
Went to console and view disks
, it just said "loading" forever *couldn't get to the "Replace" option!
Then we got:
zpool status -v
pool: raid-5x3
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
raid-5x3 DEGRADED 0 0 0
raidz1 DEGRADED 0 0 0
ada0p2 ONLINE 0 0 0
10739480653363274060 UNAVAIL 0 0 0 was /dev/ada1p2
ada2p2 ONLINE 0 0 0
ada3p2 ONLINE 0 0 0
ada1p2 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
raid-5x3/alpha:<0x0>
/mnt/raid-5x3/alpha/staff/Sound FX jw/Sound FX - scary horror/11 DR-EerieAct3-Waterphone..aif
/mnt/raid-5x3/alpha/staff/Wheelhouse Shoots/ROCKY_THE_MUSICAL/ SHOOTS/WESTPORT/Cannon-CARD-B/CONTENTS/CLIPS001/AA0876/AA087601.SIF
... then 2,860 files and "raid-5x3/alpha:<....>" entries ...
camcontrol devlist
:
<ST3000DM001-9YN166 CC4C> at scbus4 target 0 lun 0 (ada0,pass0)
<WDC WD30EZRX-00MMMB0 80.00A80> at scbus4 target 1 lun 0 (aprobe1,pass6,ada4)
<WDC WD30EZRX-00MMMB0 80.00A80> at scbus5 target 0 lun 0 (ada1,pass1)
<ST3000DM001-9YN166 CC4C> at scbus5 target 1 lun 0 (ada2,pass2)
<ASUS DRW-24B1ST a 1.04> at scbus6 target 0 lun 0 (cd0,pass3)
<Hitachi HDS5C3030ALA630 MEAOA580> at scbus7 target 0 lun 0 (ada3,pass4)
< USB Flash Memory 1.00> at scbus8 target 0 lun 0 (da0,pass5)
gpart show
=> 63 7831467 da0 MBR (3.7G)
63 1930257 1 freebsd [active] (943M)
1930320 63 - free - (32K)
1930383 1930257 2 freebsd (943M)
3860640 3024 3 freebsd (1.5M)
3863664 41328 4 freebsd (20M)
3904992 3926538 - free - (1.9G)
=> 0 1930257 da0s1 BSD (943M)
0 16 - free - (8.0K)
16 1930241 1 !0 (943M)
=> 34 5860533101 ada0 GPT (2.7T)
34 94 - free - (47K)
128 4194304 1 freebsd-swap (2.0G)
4194432 5856338703 2 freebsd-zfs (2.7T)
=> 34 5860533101 ada1 GPT (2.7T)
34 94 - free - (47K)
128 4194304 1 freebsd-swap (2.0G)
4194432 5856338703 2 freebsd-zfs (2.7T)
=> 34 5860533101 ada2 GPT (2.7T)
34 94 - free - (47K)
128 4194304 1 freebsd-swap (2.0G)
4194432 5856338703 2 freebsd-zfs (2.7T)
=> 34 5860533101 ada3 GPT (2.7T)
34 94 - free - (47K)
128 4194304 1 freebsd-swap (2.0G)
4194432 5856338703 2 freebsd-zfs (2.7T)
=> 34 5860533101 ada4 GPT (2.7T)
34 94 - free - (47K)
128 4194304 1 freebsd-swap (2.0G)
4194432 5856338703 2 freebsd-zfs (2.7T)
glabel status
Name Status Components
ufs/FreeNASs3 N/A da0s3
ufs/FreeNASs4 N/A da0s4
ufs/FreeNASs1a N/A da0s1a
gptid/446dd91d-8f15-11e1-a14c-f46d049aaeca N/A ada4p1
gptid/447999cb-8f15-11e1-a14c-f46d049aaeca N/A ada4p2
Seemed the new drive wasn't connected properly?
Re-attached it and rebooted.
Now console showed green light alert.
But when I went to "View All Volumes", it just said "Loading..."
Then:
glabel status
Name Status Components
ufs/FreeNASs3 N/A da0s3
ufs/FreeNASs4 N/A da0s4
ufs/FreeNASs1a N/A da0s1a
camcontrol devlist: Code: at scbus0 target 0 lun 0 (ada0,pass0) at scbus4 target 0 lun 0 (ada1,pass1) at scbus4 target 1 lun 0 (ada2,pass2) at scbus5 target 0 lun 0 (ada3,pass3) at scbus5 target 1 lun 0 (ada4,pass4) at scbus6 target 0 lun 0 (cd0,pass5) at scbus7 target 0 lun 0 (ada5,pass6) < USB Flash Memory 1.00> at scbus8 target 0 lun 0 (da0,pass7)
gpart show
=> 63 7831467 da0 MBR (3.7G)
63 1930257 1 freebsd [active] (943M)
1930320 63 - free - (32K)
1930383 1930257 2 freebsd (943M)
3860640 3024 3 freebsd (1.5M)
3863664 41328 4 freebsd (20M)
3904992 3926538 - free - (1.9G)
=> 0 1930257 da0s1 BSD (943M)
0 16 - free - (8.0K)
16 1930241 1 !0 (943M)
=> 34 5860533101 ada1 GPT (2.7T)
34 94 - free - (47K)
128 4194304 1 freebsd-swap (2.0G)
4194432 5856338703 2 freebsd-zfs (2.7T)
=> 34 5860533101 ada2 GPT (2.7T)
34 94 - free - (47K)
128 4194304 1 freebsd-swap (2.0G)
4194432 5856338703 2 freebsd-zfs (2.7T)
=> 34 5860533101 ada3 GPT (2.7T)
34 94 - free - (47K)
128 4194304 1 freebsd-swap (2.0G)
4194432 5856338703 2 freebsd-zfs (2.7T)
=> 34 5860533101 ada4 GPT (2.7T)
34 94 - free - (47K)
128 4194304 1 freebsd-swap (2.0G)
4194432 5856338703 2 freebsd-zfs (2.7T)
=> 34 5860533101 ada5 GPT (2.7T)
34 94 - free - (47K)
128 4194304 1 freebsd-swap (2.0G)
4194432 5856338703 2 freebsd-zfs (2.7T)
zpool status
:
pool: raid-5x3
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
raid-5x3 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
ada1p2 ONLINE 0 0 0
ada2p2 ONLINE 0 0 2
ada4p2 ONLINE 0 0 0
ada5p2 ONLINE 0 0 0
ada3p2 ONLINE 0 0 0
errors: 7607009 data errors, use '-v' for a list
At this point, someone on the FreeNAS forums said, "You're screwed, at some point you had 2 disks fail, bye bye data."
Is this true?
I clicked the 'scrub' button ... zpool status showed "resilver in progress .... 900h to go" ... which is like a month... and which kept going up to 30,000hrs...
Cut to: today, we rechecked all the connections on all the drives.
Then it started resilvering again, but much faster.
Several of the files – which were previously reported as corrupt – I randomly checked, and they now "seem" to be OK. (Meaning I was able to copy them and play them – most of our data is video files.)
What I'd like to do is COPY everything for which we do not have a backup, and which is not corrupt, to another machine, and then upgrade this one to RAIDZ2.
I'm thinking maybe what happened is that 2 drives became dislodged. I think the hotswap bay we have is poor quality.
But, then again, they DID appear connected, just faulted ... I don't know.
The resilver completed, in 3.5 hours.
Now zpool status says:
pool: raid-5x3
state: ONLINE
scrub: resilver completed after 3h31m with 0 errors on Fri Aug 17 21:46:12 2012
config:
NAME STATE READ WRITE CKSUM
raid-5x3 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
ada1p2 ONLINE 0 0 0
ada2p2 ONLINE 0 0 0 236G resilvered
ada4p2 ONLINE 0 0 0
ada5p2 ONLINE 0 0 0 252G resilvered
ada3p2 ONLINE 0 0 0
errors: No known data errors
Does this mean the data is recovered?? "No known errors" sounds promising!
I've now initiated a scrub. (8 hours to go.)
We don't have a backup for ALL the data ... so we need to figure out which of those files are corrupt, and which are usable.
Did a drive fail? If so, which one? Or did it just come loose?
Do I need to replace one? Two?
Is any of our data safe? If so, which files?