| « How to migrate from dying system disk to new one in Solaris | Rotating logs on Solaris with logadm » |
Change failed hdd in submirror in Solaris Volume Manager (SVM)
I had an hdd failed in SVM. Well, nothing special - everyone met this situation, when output like this on prod box scare you as hell:
# iostat -En
...
c1t0d0 Soft Errors: 607 Hard Errors: 122 Transport Errors: 130
Vendor: SEAGATE Product: ST373207LSUN72G Revision: 065A Serial No: 054432ZX97
Size: 73.40GB
Media Error: 99 Device Not Ready: 0 No Device: 11 Recoverable: 607 Illegal Request: 0 Predictive Failure Analysis: 9
c1t1d0 Soft Errors: 0 Hard Errors: 2 Transport Errors: 5 Vendor: FUJITSU Product: MAT3073N SUN72G Revision: 0602 Serial No: 0517B04DKE
Size: 73.40GB
Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c1t2d0 Soft Errors: 60 Hard Errors: 119 Transport Errors: 78
Vendor: SEAGATE Product: ST373207LSUN72G Revision: 065A Serial No: 064135KFJE Size: 73.40GB ...
Well, what are this "errors" stuff is about?
Device Not Ready: The drive returned the sense key 0x2 (Not ready).
Media Error: The drive returned the sense key 0x3(Medium Error).
No Device: The drive returned the sense key 0x6 (Unit Attention) or in the case of a removable device it must have happened multiple times.
Hard Errors: All the above conditions are counted as Hard errors with the addition of the SCSI sense key 0x4 (Hardware Error).
Illegal Request: The drive returned the sense key 0x5 (Illegal Request). This also treats as a Soft Error and that kstat is also incremented.
Recoverable: The drive returned the sense key 0x1 (Recovered Error) to indicate that the last command completed successfully but some recovery action had to be taken by the drive. This also treats as a Soft Error and that kstat is also incremented.
Predictive Failure Analysis: The drive returned sense key 0x6 (Unit Attention) with and ASC (Additional Sense Code) of 0x5D indicating that the drive has exceeded it's predictive failure threshold. This is treated as a soft error.
Transport Error: This error occurs for a number of reasons all related to being unable to transport the command. The command could have been timed out or reset or the host bus adapter unable to put the command onto the SCSI bus. This is neither as soft nor a hard error.
Device Not Ready: The drive returned the sense key 0x2 (Not ready).
Media Error: The drive returned the sense key 0x3(Medium Error).
No Device: The drive returned the sense key 0x6 (Unit Attention) or in the case of a removable device it must have happened multiple times.
Hard Errors: All the above conditions are counted as Hard errors with the addition of the SCSI sense key 0x4 (Hardware Error).
Illegal Request: The drive returned the sense key 0x5 (Illegal Request). This also treats as a Soft Error and that kstat is also incremented.
Recoverable: The drive returned the sense key 0x1 (Recovered Error) to indicate that the last command completed successfully but some recovery action had to be taken by the drive. This also treats as a Soft Error and that kstat is also incremented.
Predictive Failure Analysis: The drive returned sense key 0x6 (Unit Attention) with and ASC (Additional Sense Code) of 0x5D indicating that the drive has exceeded it's predictive failure threshold. This is treated as a soft error.
Transport Error: This error occurs for a number of reasons all related to being unable to transport the command. The command could have been timed out or reset or the host bus adapter unable to put the command onto the SCSI bus. This is neither as soft nor a hard error.
Since I have a lot of hard/transport errors, and my company have Platinum support from Sun, I reordered the parts (2 hdds) to swap the old ones, which throw errors.
This is my steps to change the faulted hdd in SVM for RAID-O (mirroring) (in this example, faulting hdd is c1t1d0, and submirror living on it is d100, mirror using this submirror is d1):
1. Identify the submirrors which reside on faulting disk drives
# metastat
<look for the slices that are being used on faulting drives, note them>
2. Detatch this submirrors from the mirrors
# metadetatch d1 d100
<example, detatching from mirror d1 submirror d100, that lives on faulty hdd>
3. Clear the info used by SVM on detatched submirror
# metaclear d100
4. Check that you get rid of that submirror:
# metastat -p | grep d100
5. Remove SVM database replicas on that disk, that will be replaced
# metadb | grep c1t1d0
6. If you`ll find any replicas on this disk - delete them - and check, that they were deleted
# metadb -d /dev/c1t1d0s7 && metadb | grep c1t1d0
7. If there is open filesystem, that is not being controlled by SVM - unmount them
8. Find & unconfigure the faulty hdd, replace the hdd - checking the dmesg - and configure the new one
# cfgadm -al
# cfgadm -c unconfigure c1::dsk/c1t1d0
# dmesg
# cfgadm -c configure c1::dsk/c1t1d0
# cfgadm -al
<assumed, that you replaced hdd in the same slot that was occupied by the faulty one>
9. Format & place VTOC on this hdd, as it was used by the another in other submirror (or you can save the VTOC output previously to replace the HDD)
# prtvtoc /dev/rdsk/c1t0d0s2 | fmthard -s - /dev/rdsk/c1t1d0s2
10. Update the DevID
# metadevadm -u c1t1d0
11. Recreate, if necessary the replica DB
# metadb -a -c 2 c1t1d0s1
12. Recreate and attatch submirror to corresponding mirror
# metainit d100 1 1 c1t1d0s0
# metattatch d100 d1
13. Check that everything ok, and submirrors are resyncing
# metastat d1
...
metastat d100 d100: MirrorSubmirror 0: d120 State: Resyncing Submirror 1: d110 State: Okay Resync in progress: 14 % done...
14. Grab a beer!
Only thing that keeps me wondering - how to reset the "iostat -En" output to reset the error counters? This will require reboot - but I wanna to find another way to implement this...
Cheers!
4 comments
with regards
Anandraj
Thanks in advance