Welcome to Nik Maslov`s blog
« How to migrate from dying system disk to new one in SolarisRotating logs on Solaris with logadm »

Change failed hdd in submirror in Solaris Volume Manager (SVM)

Permalink 11/03/09 12:29, by Nik Maslov, Categories: Background

I had an hdd failed in SVM. Well, nothing special - everyone met this situation, when output like this on prod box scare you as hell:

 

 

 

 

# iostat -En

 

 

...

c1t0d0          Soft Errors: 607 Hard Errors: 122 Transport Errors: 130
Vendor: SEAGATE  Product: ST373207LSUN72G  Revision: 065A Serial No: 054432ZX97
Size: 73.40GB
Media Error: 99 Device Not Ready: 0 No Device: 11 Recoverable: 607

Illegal Request: 0 Predictive Failure Analysis: 9
c1t1d0          Soft Errors: 0 Hard Errors: 2 Transport Errors: 5

Vendor: FUJITSU  Product: MAT3073N SUN72G  Revision: 0602 Serial No: 0517B04DKE
Size: 73.40GB
Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c1t2d0          Soft Errors: 60 Hard Errors: 119 Transport Errors: 78
Vendor: SEAGATE  Product: ST373207LSUN72G  Revision: 065A Serial No: 064135KFJE

Size: 73.40GB
...

 

Well, what are this "errors" stuff is about?


Device Not Ready: The drive returned the sense key 0x2 (Not ready).

Media Error: The drive returned the sense key 0x3(Medium Error).

No Device: The drive returned the sense key 0x6 (Unit Attention) or in the case of a removable device it must have happened multiple times.

Hard Errors: All the above conditions are counted as Hard errors with the addition of the SCSI sense key 0x4 (Hardware Error).

Illegal Request: The drive returned the sense key 0x5 (Illegal Request). This also treats as a Soft Error and that kstat is also incremented.

Recoverable: The drive returned the sense key 0x1 (Recovered Error) to indicate that the last command completed successfully but some recovery action had to be taken by the drive. This also treats as a Soft Error and that kstat is also incremented.

Predictive Failure Analysis: The drive returned sense key 0x6 (Unit Attention) with and ASC (Additional Sense Code) of 0x5D indicating that the drive has exceeded it's predictive failure threshold. This is treated as a soft error.

Transport Error: This error occurs for a number of reasons all related to being unable to transport the command. The command could have been timed out or reset or the host bus adapter unable to put the command onto the SCSI bus. This is neither as soft nor a hard error.

 

Device Not Ready: The drive returned the sense key 0x2 (Not ready).

Media Error: The drive returned the sense key 0x3(Medium Error).

No Device: The drive returned the sense key 0x6 (Unit Attention) or in the case of a removable device it must have happened multiple times.

Hard Errors: All the above conditions are counted as Hard errors with the addition of the SCSI sense key 0x4 (Hardware Error).

Illegal Request: The drive returned the sense key 0x5 (Illegal Request). This also treats as a Soft Error and that kstat is also incremented.

Recoverable: The drive returned the sense key 0x1 (Recovered Error) to indicate that the last command completed successfully but some recovery action had to be taken by the drive. This also treats as a Soft Error and that kstat is also incremented.

Predictive Failure Analysis: The drive returned sense key 0x6 (Unit Attention) with and ASC (Additional Sense Code) of 0x5D indicating that the drive has exceeded it's predictive failure threshold. This is treated as a soft error.

Transport Error: This error occurs for a number of reasons all related to being unable to transport the command. The command could have been timed out or reset or the host bus adapter unable to put the command onto the SCSI bus. This is neither as soft nor a hard error.


Since I have a lot of hard/transport errors, and my company have Platinum support from Sun, I reordered the parts (2 hdds) to swap the old ones, which throw errors.


This is my steps to change the faulted hdd in SVM for RAID-O (mirroring) (in this example, faulting hdd is c1t1d0, and submirror living on it is d100, mirror using this submirror is d1):

1. Identify the submirrors which reside on faulting disk drives

 

# metastat

 

<look for the slices that are being used on faulting drives, note them>

2.  Detatch this submirrors from the mirrors

 

 

# metadetatch d1 d100

<example, detatching from mirror d1 submirror d100, that lives on faulty hdd>

 

3. Clear the info used by SVM on detatched submirror

 

# metaclear d100

4. Check that you get rid of that submirror:

# metastat -p | grep d100

5. Remove SVM database replicas on that disk, that will be replaced

# metadb | grep c1t1d0

6. If you`ll find any replicas on this disk - delete them - and check, that they were deleted

# metadb -d /dev/c1t1d0s7 && metadb | grep c1t1d0

7.  If there is open filesystem, that is not being controlled by SVM - unmount them

8. Find & unconfigure the faulty hdd, replace the hdd - checking the dmesg - and configure the new one

# cfgadm -al

# cfgadm -c unconfigure c1::dsk/c1t1d0

# dmesg

# cfgadm -c configure c1::dsk/c1t1d0

# cfgadm -al

<assumed, that you replaced hdd in the same slot that was occupied by the faulty one>

9. Format & place VTOC on this hdd, as it was used by the another in other submirror (or you can save the VTOC output previously to replace the HDD)

# prtvtoc /dev/rdsk/c1t0d0s2 | fmthard -s - /dev/rdsk/c1t1d0s2

10. Update the DevID

# metadevadm -u c1t1d0

11. Recreate, if necessary the replica DB

# metadb -a -c 2 c1t1d0s1

12. Recreate and attatch submirror to corresponding mirror

 

 

# metainit d100 1 1 c1t1d0s0

# metattatch d100 d1

 

13. Check that everything ok, and submirrors are resyncing

 

 

# metastat d1

 

 

...

metastat d100
d100: Mirror
Submirror 0: d120
State: Resyncing
Submirror 1: d110
State: Okay
Resync in progress: 14 % done
...

 

 

14. Grab a beer!

Only thing that keeps me wondering - how to reset the "iostat -En" output to reset the error counters? This will require reboot - but I wanna to find another way to implement this...

Cheers!

4 comments »

4 comments

Comment from: anandraj [Visitor]
Ya, am also searching a option for clearing the error displaying by the commaand iostat -En without Restating the machine. anyone help me .......


with regards
Anandraj
04/10/10 @ 08:49
Comment from: Rev. [Visitor]
did you ever find a way to clear errors displayed via command iostat without rebooting?

Thanks in advance
05/10/10 @ 21:29
Comment from: Nik Maslov [Member] Email · http://www.nikmaslov.com
Rev - no, AFAIK that`s impossible without reboot; Sun engineer approved :)
06/10/10 @ 19:52
Have a share in your good articles enjoyably with my friends that provides us much entertaining. By your own superior blog, I guess you perhaps enjoy this Jordan Shoes the same with us. Yet I appreciate much more Air Jordan Shoes on the market. For me, all things are hard just before they're easy. What ever we carry out, we must place our own heart and soul in it. Thank you for you blog. Additionaly, I wish to share the experience with you, we ought to read more details on the net before all of us get any pair of Air Jordan shoes.
08/28/10 @ 10:49

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)

Search

XML Feeds

StatCounter

powered by free blog software

©2010 by Nik Maslov

Contact | Blog skins by Asevo | blog software | web hosting | monetize